Using BLIS Building Blocks:

Slides:



Advertisements
Similar presentations
ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,
Advertisements

Don Batory, Bryan Marker, Rui Gonçalves, Robert van de Geijn, and Janet Siegmund Department of Computer Science University of Texas at Austin Austin, Texas.
Block LU Factorization Lecture 24 MA471 Fall 2003.
1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.
Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Tareq Malas Advisors: Prof. David Keyes, Dr. Aron Ahmadia.
Everything You Always Wanted to Know About CIP Codes Classification of Instructional Programs Codes.
Teaching Courses in Scientific Computing 30 September 2010 Roger Bielefeld Director, Advanced Research Computing.
The FLAME Project Faculty: Robert van de Geijn (CS/ICES) Don Batory (CS) Maggie Myers (SDS) John Stanton (Chem) Victor (TACC) Research Staff: Field Van.
Department of Mathematics and Computer Science
A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
Reconfigurable Application Specific Computers RASCs Advanced Architectures with Multiple Processors and Field Programmable Gate Arrays FPGAs Computational.
Parallel & Cluster Computing Linear Algebra Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma SC08 Education.
Computer Science Graduate Programs at UTSA Dr. Weining Zhang.
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Alan Edelman, Jeff Bezanson Viral Shah, Stefan Karpinski Jeremy Kepner and the vibrant open-source community Computer Science & AI Laboratories Novel Algebras.
Graduate Programs in Dept of Computer Science Univ. of Texas at San Antonio Dr. Weining Zhang.
Beyond GEMM: How Can We Make Quantum Chemistry Fast? or: Why Computer Scientists Don’t Like Chemists Devin Matthews 9/25/ BLIS Retreat1.
1 WORKSHOP ON COMPUTER SCIENCE EDUCATION Innovation of Computer Science Curriculum in Higher Education TEMPUS project CD-JEP 16160/2001.
10/17/2015 Stakeholders and How to Engage Them All – How to Ensure Success of This Initiative? Jie Wu Dept. of Computer and Information Sciences Temple.
Mcs/ HPC challenges in Switzerland Marie-Christine Sawley General Manager CSCS SOS8, Charleston April,
SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, UT Austin 1.
The Current State of BLIS
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
“Moh’d Sami” AshhabSummer 2008University of Jordan MATLAB By (Mohammed Sami) Ashhab University of Jordan Summer 2008.
Scaling up R computation with high performance computing resources.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
“A Learner-Centered Computational Experience in Nanotechnology for Undergraduate STEM Students” IEEE ISEC 2016 Friend Center at Princeton University March.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.
Progress Report—11/13 宗慶. Problem Statement Find kernels of large and sparse linear systems over GF(2)
June 9-11, 2007SPAA SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.
High-performance Implementations of Fast Matrix Multiplication with Strassen’s Algorithm Jianyu Huang with Tyler M. Smith, Greg M. Henry, Robert A. van.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
Generating Families of Practical Fast Matrix Multiplication Algorithms
Stanford University.
TI Information – Selective Disclosure
BLIS: Year In Review, Field G. Van Zee
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Using BLIS Building Blocks:
The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16) Exploring Vectorization Possibilities on the Intel Xeon Phi for.
Services Computing Taxonomy
Data Science and Statistical Agencies
A computational loop k k Integration Newton Iteration
High-Performance Matrix Multiplication
BLIS optimized for EPYCTM Processors
Engineering Better Learning
Texas Instruments TDA2x and Vision SDK
Coding FLAME Algorithms with Example: Cholesky factorization
CMPE419 Mobile Application Development
Hyperthreading Technology
(Mohammed Sami) Ashhab
Impact Panel SI^2 PIs Meeting.
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
libflame optimizations with BLIS
Dane Stubben QuintilesIMS Database Manager
Linchuan Chen, Peng Jiang and Gagan Agrawal
Adaptive Strassen and ATLAS’s DGEMM
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Scalable Parallel Interoperable Data Analytics Library
A Survey on Virtualization Technologies
Clouds from FutureGrid’s Perspective
Results (Accuracy of Low Rank Approximation by iSVD)
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Autonomous Operations in Space
CMPE419 Mobile Application Development
A computational loop k k Integration Newton Iteration
Presentation transcript:

Using BLIS Building Blocks: SI2-SSI: Sustaining Innovation in the Linear Algebra Software Stack for Computational Chemistry and other Sciences Don Batory, Victor Eijkhout, Devin Matthews, Maggie Myers, John Stanton, Robert van de Geijn, Field Van Zee, Department of Computer Science, Department of Statistics and Data Sciences, Institute for Computational Engineering and Sciences.Department of Chemistry and Biochemistry, Texas Advanced Computing Center The University of Texas at Austin shpc.ices.utexas.edu The Science of High-Performance Computing Group Scientific applications such as quantum chemistry are enabled by efficient computational kernels, especially from linear algebra. The development of the BLAS-like Library Instantiation Software (BLIS) framework and libflame library with LAPACK functionality as part of this grant not only rivals the efficiency of proprietary vendor libraries, but has the potential to support scientific applications in novel ways. The project incorporatesi innovative pedagogical outreach, ranging from undergraduate research involvement to the development of Massive Open Online Courses offered through edX. The Dense Linear Algebra Software Stack Machine Learning Primitives Open Source Software BLAS-like Library Instantiation Software (BLIS) This portable framework for rapidly instantiating BLAS-like libraries is a refactoring of Kazushige Goto's approach to the implementation of the level-3 BLAS. Importantly, BLIS casts virtually all level-3 computation in terms of a single "micro-kernel," and also serves as a productivity lever when supporting higher-level DLA functionality. Level-2 operations were also restructured, exposing key so-called level-1v (vector) and level-1f (fused) kernel operations. Software developed by this project is being distributed by AMD, ARM, Texas Instruments, and Movidius. It is also at the core of research collaborations with Intel and HP. 60 cores Recent publications. Chenhan D. Yu*, William March and George Biros. An NlogN Parallel Fast Direct Solver for Kernel Matrices, IPDPS'17 Chenhan D. Yu*, William March, Bo Xiao and George Biros. Inv-Askit: A Parallel Fast Direct Solver for Kernel Matrices, IPDPS’16. Chenhan D. Yu*, Jianyu Huang, Woody Austin, Bo Xiao, George Biros. Performance Optimization for the K-Nearest Neighbors Kernel on x86 Architectures, SC'15. BLIS vs. MKL on the Intel Xeon Phi Strassen’s Algorithm and Fast MM 16 cores libflame: This general-purpose DLA library is built atop existing BLAS APIs and provides an object-based environment in which to build custom linear algebra operations and application solutions. It natively provides much of the functionality commonly used in LAPACK, and provides full backward compatibility with existing LAPACK libraries, even where native functionality is not yet present. BLIS vs. ESSL on the IBM PowerPC A2 Recent publications. F. G. Van Zee, Robert A. van de Geijn. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM TOMS, 2015. F. G. Van Zee, T. Smith, B. Marker, T. M. Low, R. A. van de Geijn, F. D. Igual, M. Smelyanskiy, X. Zhang, M. Kistler, V. Austel, J. Gunnels, L. Killough. The BLIS Framework: Experiments in Portability. ACM TOMS, 2016 T. M. Smith, R. van de Geijn, M. Smelyanskiy, J. R. Hammond, F. G. Van Zee. Anatomy of High-Performance Many-Threaded Matrix Multiplication. IPDPS, 2014. T. M. Low, F. D. Igual, T. M. Smith, and E. S. Quintana-Orti. Analytical Modeling is Enough for High Performance BLIS. ACM TOMS, 2016 Using BLIS Building Blocks: Recent publications. J. Huang, T. M. Smith, G. M. Henry, R. A. van de Geijn, Strassen’s Algorithm Reloaded, SC16. J. Huang, L. Rice, D. A. Matthews, R. A. van de Geijn, Generating Families of Practical Fast Matrix Multiplication Algorithms, IPDPS, 2017, accepted. Training the Next Generation of Computational Scientists TBLIS: Dense and Hierarchical Tensor Contractions for Quantum Chemistry Stand-alone materials based on the edX course. Introductory linear algebra for computer science and computational science students. Uses notation and APIs from the FLAME project. ~900 pages of notes ~270 videos web-based activities IPython notebooks (Spring 2014) MATLAB (Spring 2015, Summer 2015) “pick-your-price” (or get it free)   Offered Spring 2014, Spring 2015, Summer 2015, Fall 2016 and Spring 2017 100,000+ registered participants “Square” MM and TC on Xeon Phi 7210 Starts April 2017 Recent publications. D. A. Matthews, High-Performance Tensor Contraction without Transposition, SIAM SISC, in review.