Generic Compressed Matrix Insertion P ETER G OTTSCHLING – S MART S OFT /TUD D AG L INDBO – K UNGLIGA T EKNISKA H ÖGSKOLAN SmartSoft – TU Dresden

Slides:



Advertisements
Similar presentations
A Discrete Adjoint-Based Approach for Optimization Problems on 3D Unstructured Meshes Dimitri J. Mavriplis Department of Mechanical Engineering University.
Advertisements

UBlas: Boost High Performance Vector and Matrix Classes Juan José Gómez Cadenas University of Geneve and University of Valencia (thanks to: Joerg Walter,
Math 130 Introduction to Computing Sorting Lecture # 17 10/11/04 B Smith: Save until Week 15? B Smith: Save until Week 15? B Smith: Skipped Spring 2005?
CSCI 317 Mike Heroux1 Sparse Matrix Computations CSCI 317 Mike Heroux.
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
COMP 171 Data Structures and Algorithms Tutorial 1 Template and STL.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
PETSc Portable, Extensible Toolkit for Scientific computing.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.
Parallel & Cluster Computing Linear Algebra Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma SC08 Education.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
AceGen and AceFEM packages
Department of Electrical Engineering National Cheng Kung University
Abstract Data Types (ADTs) Data Structures The Java Collections API
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Pointers (Continuation) 1. Data Pointer A pointer is a programming language data type whose value refers directly to ("points to") another value stored.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Makoto Kudoh*1, Hisayasu Kuroda*1,
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Wavelet Transforms CENG 5931 GNU RADIO INSTRUCTOR: Dr GEORGE COLLINS.
Lecture No.01 Data Structures Dr. Sohail Aslam
Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics.
1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.
CS179: GPU Programming Lecture 11: Lab 5 Recitation.
Case studies in Optimizing High Performance Computing Software Jan Westerholm High performance computing Department of Information Technologies Faculty.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
Computación algebraica dispersa con GPUs y su aplicación en tomografía electrónica Non-linear iterative optimization method for locating particles using.
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
1 Object Oriented Programming Lecture IX Some notes on Java Performance with aspects on execution time and memory consumption.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western.
OPENPROD ITEA2 Final Review Meeting EDF - Site de Chatou University of Applied Science Bernhard Bachmann.
JAVA AND MATRIX COMPUTATION
ACES WorkshopJun-031 ACcESS Software System & High Level Modelling Languages by
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
STL CSSE 250 Susan Reeder. What is the STL? Standard Template Library Standard C++ Library is an extensible framework which contains components for Language.
Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
Sparse Matrix Dense Vector Multiplication by Pedro A. Escallon Parallel Processing Class Florida Institute of Technology April 2002.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Introduction to Programming Lecture # 43. Math Library Complex number Matrix Quadratic equation and their solution …………….…
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Examples (D. Schmidt et al)
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Progress Report— 11/06 宗慶.
Jens Krüger Technische Universität München
Richard Dorrance Literature Review: 1/11/13
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
structures and their relationships." - Linus Torvalds
The Future of Fortran is Bright …
Part I – Matlab Basics.
Lab 03 - Iterator.
GPU Implementations for Finite Element Methods
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
GENERAL VIEW OF KRATOS MULTIPHYSICS
structures and their relationships." - Linus Torvalds
Presentation transcript:

Generic Compressed Matrix Insertion P ETER G OTTSCHLING – S MART S OFT /TUD D AG L INDBO – K UNGLIGA T EKNISKA H ÖGSKOLAN SmartSoft – TU Dresden Tel.: +49 (0)

Software libraries MTL4 FEniCS Compressed sparse matrices Insertion Benchmarks Vision Overview

Generic library for high-performance numeric operations in mathematical notation Many new techniques as implicit enable-if and meta- tuning Most modern iterative solvers Focus on high-performance simulation: FEM/XFEM/FVM/FDM Commercial version in preparation Parallel version in progress Multi-core, GPU support and multigrid in near future Matrix Template Library 4

Innovative Produktentwicklung durch Finite-Elemente-Methode (FEM) Innovative Produktentwicklung durch template < class LinearOperator, class HilbertSpaceX, class HilbertSpaceB, class Preconditioner, class Iteration > int cg(const LinearOperator& A, HilbertSpaceX& x, const HilbertSpaceB& b, const Preconditioner& M, Iteration& iter) { typedef typename mtl::Collection ::value_type Scalar; Scalar rho, rho_1, alpha, beta; HilbertSpaceX p(size(x)), q(size(x)), r(size(x)), z(size(x)); r = b - A*x; while (! iter.finished(r)) { z = solve(M, r); rho = dot(r, z); if (iter.first()) p = z; else { beta = rho / rho_1; p = z + beta * p; } q = A * p; alpha = rho / dot(p, q); x += alpha * p; r -= alpha * q; rho_1 = rho; ++iter; } return iter; } Linearer Gleichungslöser

Free software for solving differential equations FFC – FEniCS Form Compiler High-level math language for formulating differential equations Generate C++ code DOLFIN – generic FEM kernel C++ library for FEM cores: assembler, mesh and function abstraction Interface to uBLAS, PETSc, Trillinos, and MTL4 Paper focus in matrix assembly FEniCS

Compressed Sparse Row Format Most common general-purpose sparse format Entries sorted Kind of run- length encoding on rows

In-Flight Insertion Very simple use Like dense matrices Simple realization Extremely expensive All following entries are changed Quadratic complexity A[0][1]= 6;

Dedicated insertion phase Matrix is available after terminating insertion Later modification impossible Works for distributed matrices as well Used in PETSc, includes construction of communication buffers for dist. SpMVP Janus derives its name from it (two faces) Two-phase Insertion

Inserter = object providing operations to set up other objects, e.g. matrices or vectors, efficiently Insertion phase lasts as long as inserter lives Insert within a scope (block, function) Matrix ready when inserter destroyed Later insertion possible with another inserter Extends to distributed matrices and vectors MTL4 inserters have minimal memory usage Inserter Concept in MTL4

int main(int argc, char* argv[]) { compressed2D A(3, 5); { matrix::inserter > ins(A); ins[0][0] << 1.0; ins[0][2] << 2.0; ins[1][3] << 3.0; ins[2][1] << 4.0; ins[2][4] << 5.0; } std::cout << "A is\n" << A << '\n'; return 0; } Using Inserters

Direct Insertion Reserve s entries per row Find insert position By linear or binary search Move remainder in row Linear in s That is constant A[0][1]= 6;

Indirect Insertion For saturated rows use “spare” container std::map of index pair Logarithmic in number of spare entries Additional allocation About 10 times slower than direct insertion A[0][4]= 7;

Assemble CRS matrix Row order important, and order within row Performance measure: number of non-zeros inserted per second Reassembly Three libraries: uBLAS (including vector-of-vector), MTL4, PETSc Ordinary workstation (Intel) All benchmarks run in a simple interface routine for each library, e.g. Benchmark void insert row(Matrix& A, int row_idx, int ∗ cols_idx, double ∗ a, int n) { for(int j=0; j<n; j++) A(row_idx, cols_idx[j]) += a[j]; }

10,000 rows, 5 non-zeros/row MTL4: 46 million entries per second uBLAS: 5.9 million entries per second uBLAS (gov): 2 million entries per second PETSc: 22 million entries per second Benchmark: Assembly rate with ascending rows

100,000 rows, 50 non-zeros/row MTL4: 29.6 million entries per second uBLAS: 6.5 million entries per second uBLAS (gov): 2.8 million entries per second PETSc: 32.3 million entries per second Benchmark: Assembly rate with ascending rows

10,000 rows, 5 non-zeros/row MTL4: 41.4 million entries per second uBLAS: 31,300 entries per second uBLAS (gov): 1.9 million entries per second PETSc: 19.9 million entries per second Benchmark: Assembly rate with random rows

100,000 rows, 50 non-zeros/row MTL4: 25.6 million entries per second uBLAS: measuring abandonned uBLAS (gov): 2.7 million entries per second PETSc: 25.6 million entries per second Benchmark: Assembly rate with random rows

10,000 rows, 5 non-zeros/row MTL4: 4.8 million entries per second uBLAS: 16,700 entries per second uBLAS (gov): 1.8 million entries per second PETSc: 15,900 entries per second Benchmark: Assembly rate with entirely random entries

10,000 rows, 50 non-zeros/row MTL4: 2.9 million entries per second uBLAS: 3,340 entries per second uBLAS (gov): 1.7 million entries per second PETSc: 13,400 entries per second Benchmark: Assembly rate with random rows

How to do Science in Silicon? Graphic application CPU GPU

Scientific Software Scientific application CPU GPUMulti-CorePar. Arch. Scien. Proc.

Introduced new approach for setting and modifying compressed sparse matrices Does not need preparation phase Minimal memory footprint Optimal performance Tuned block-insertion under progress Extends to distributed data structures Conclusions