Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November CSRI.

Slides:

Advertisements

Similar presentations

Advertisements

Lists and the Collection Interface Chapter 4. Chapter Objectives To become familiar with the List interface To understand how to write an array-based.

C Structures What is a structure? A structure is a collection of related variables. It may contain variables of many different data types---in contrast.

MPI Message Passing Interface

Part IV: Memory Management

Dynamic Memory Allocation I Topics Simple explicit allocators Data structures Mechanisms Policies CS 105 Tour of the Black Holes of Computing.

Dynamic Memory Allocation I Topics Basic representation and alignment (mainly for static memory allocation, main concepts carry over to dynamic memory.

Remote Procedure Call (RPC)

Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

Cmput Lecture 8 Department of Computing Science University of Alberta ©Duane Szafron 2000 Revised 1/26/00 The Java Memory Model.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

1 1 Lecture 4 Structure – Array, Records and Alignment Memory- How to allocate memory to speed up operation Structure – Array, Records and Alignment Memory-

OOP in Java Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.

Encapsulation by Subprograms and Type Definitions

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Chapter 13 Embedded Systems

Lists and the Collection Interface Chapter 4. Chapter 4: Lists and the Collection Interface2 Chapter Objectives To become familiar with the List interface.

Fall 2007CS 2251 Lists and the Collection Interface Chapter 4.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles C/C++ Emery Berger and Mark Corner University of Massachusetts.

1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

University of Washington CSE 351 : The Hardware/Software Interface Section 5 Structs as parameters, buffer overflows, and lab 3.

OOP Languages: Java vs C++

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Language Evaluation Criteria

1 CISC181 Introduction to Computer Science Dr. McCoy Lecture 19 Clicker Questions November 3, 2009.

Overview of Previous Lesson(s) Over View  OOP  A class is a data type that you define to suit customized application requirements.  A class can be.

German National Research Center for Information Technology Research Institute for Computer Architecture and Software Technology German National Research.

Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.

Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Cpt S 122 – Data Structures Custom Templatized Data Structures.

Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.

Operating System Support for Virtual Machines Samuel T. King, George W. Dunlap,Peter M.Chen Presented By, Rajesh 1 References [1] Virtual Machines: Supporting.

CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

CS179: GPU Programming Lecture 11: Lab 5 Recitation.

CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 12: Pointers continued, C strings.

Object-Oriented Program Development Using Java: A Class-Centered Approach, Enhanced Edition.

C++ History C++ was designed at AT&T Bell Labs by Bjarne Stroustrup in the early 80's Based on the ‘C’ programming language C++ language standardised in.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

Design Patterns Gang Qian Department of Computer Science University of Central Oklahoma.

Implementing Subprograms What actions must take place when subprograms are called and when they terminate? –calling a subprogram has several associated.

EXTENSIBILITY, SAFETY AND PERFORMANCE IN THE SPIN OPERATING SYSTEM

Scalable Linear Algebra Capability Area Michael A. Heroux Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation,

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

Standard Template Library The Standard Template Library was recently added to standard C++. –The STL contains generic template classes. –The STL permits.

OpenCL Programming James Perry EPCC The University of Edinburgh.

CS 261 – Data Structures Introduction to C Programming.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Full and Para Virtualization

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Memory Management Overview.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.

Machine Independent Optimizations Topics Code motion Reduction in strength Common subexpression sharing.

ICOM 4035 – Data Structures Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 – August 23, 2001.

Object-Oriented Programming (OOP) and C++

ECE 750 Topic 8 Meta-programming languages, systems, and applications Automatic Program Specialization for J ava – U. P. Schultz, J. L. Lawall, C. Consel.

Examples (D. Schmidt et al)

C++ History C++ was designed at AT&T Bell Labs by Bjarne Stroustrup in the early 80's Based on the ‘C’ programming language C++ language standardised in.

The HP OpenVMS Itanium® Calling Standard

Chapter 4: Threads.

The Challenge of Cross - Language Interoperability

Presentation transcript:

Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November CSRI

2Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Introducing Tpetra and Kokkos Tpetra provides a next generation implementation of the Petra Object Model. – This is a framework for distributed linear algebra objects. – Tpetra is a successor Epetra. Kokkos is an API for programming to a generic parallel node. – Kokkos memory model allows code to be targeted to traditional ( CPU ) and non-traditional ( accelerated ) nodes. – Kokkos computational model provides a set of constructs for parallel computing operations.

3Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra Organization Tpetra follows the Petra Object Model currently implemented in Epetra: – Map describes the distribution of object data across nodes. – Teuchos::Comm abstracts internode communication. – Import, Export, Distributor utility classes facilitate efficient data transfer. – Operator, RowMatrix, RowGraph provide abstract interfaces. – Vector, MultiVector, CrsGraph, CrsMatrix are concrete implementations that are the workhorse of Tpetra-centered codes. Any class with significant data is templated. Any class with significant computation uses Kokkos.

4Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra vs. Epetra Most of the functionality of Epetra is present in Tpetra. Some differences prohibit a “find-replace” migration: EpetraTpetra Epetra_MpiComm comm(...); Epetra_Map map(numGlobal, 0, comm); Epetra_CrsMatrix A( Copy, map, &nnz, true ); Epetra_Vector x(map), y(map); A->Apply(x,y); RCP comm = rcp(...); Map map(numGlobal, 0, comm); CrsMatrix A( rcpFromRef(map), nnz, StaticProfile ); Vector x( rcpFromRef(map) ), y( rcpFromRef(map) ); A->apply(x,y); - Minor interface changes - Dependency on Kokkos package - Introduction of templated classes

5Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra Templated Classes A limitation of Epetra is that the implementation is tied to double and int. – Deployment of Epetra discourages significant modifications. – Published interface limits the possible implementation changes. Clean slate and compiler availability allow Tpetra to address this via template parameters to classes. This provides numerous capability extensions: – No 4GB limit: surpassing int enables arbitrarily large problems. – Arbitrary scalar types: float, complex, matrix, qd_real – Greater efficiency.

6Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra Basic Template Parameters Three primary template arguments: – LocalOrdinal, GlobalOrdinal, Scalar Scalar enables the description of numerical objects over different fields. – Any mathematically well-defined type is supported. – Additionally, require support under Teuchos::ScalarTraits and Teuchos::SerializationTraits. LocalOrdinal describes local element indices. – Intended to enable efficiency; should be chosen as small as possible. GlobalOrdinal describes global element indices. – Intended to enable larger problem sizes. – Decoupling necessary when the number of nodes is large.

7Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra Template Examples Map global_size_t getGlobalNumElements() size_t getNodeNumElements() LocalOrdinal getLocalElement(GlobalOrdinal gid) GlobalOrdinal getGlobalElement(LocalOrdinal lid) CrsMatrix global_size_t getGlobalNumEntries() size_t getNodeNumEntries() void getGlobalRowView(GlobalOrdinal gid, ArrayRCP &inds, ArrayRCP &vals)

8Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra Advanced Template Parameters Other template arguments exist to provide additional flexibility in Tpetra object implementation: – Node template argument specifies a Kokkos node. – Local data structures and implementations also flexible. Example: CrsMatrix ScalarField for matrix values LOintType of local indices GOLOType of global indices NodeKokkos::DefaultNodeKokkos node for local operations LclMatVecKokkos::DefaultSparseMultiply Implementation of local sparse mat-vec LclMatSolveKokkos::DefaultSparseSolve Implementation of local sparse solve

9Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Kokkos Parallel Node API Want: minimize the effort needed to port Tpetra The goal of Kokkos is to allow code, once written, to be run on any parallel node, regardless of architecture. Difficulties are many  Difficulty #1: Many different memory architectures – Node may have multiple, disjoint memory spaces. – Optimal performance may require special memory placement. Difficulty #2: Kernels must be tailored to architecture – Implementation of optimal kernel will vary between archs – No universal binary  need for separate compilation paths

10Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Kokkos Node API Kokkos provides two components: – Kokkos memory model addresses Difficulty #1 Allocation, deallocation and efficient access of memory compute buffer: special memory allocation used exclusively for parallel computation – Kokkos compute model addresses Difficulty #2 Description of kernels for parallel execution on a node Provides stubs for common parallel work constructs – Parallel for loop – Parallel reduction Supporting a new platform only a matter of implementing these models, i.e., implementing a new Node object.

11Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Kokkos Memory Model A generic node model must at least – support the scenario involving distinct memory regions – allow efficient memory access under traditional scenarios Node provides the following memory handling routines: ArrayRCP Node::allocBuffer (size_t sz); void Node::copyToBuffer (ArrayView src, ArrayRCP dest); void Node::copyFromBuffer (ArrayRCP src, ArrayView dest); ArrayRCP Node::viewBuffer (ArrayRCP buff); void Node::readyBuffer (ArrayRCP buff);

12Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Kokkos Compute Model Have to find the correct level for programming the node: – Too low: code dot(x,y) for each node Too much work to move to a new platform. Effort of writing dot() duplicates that of norm1() – Too high: code dot(x,y) for all nodes. Can’t exploit hardware features. API becomes a programming language without a compiler. Somewhere in the middle: – Parallel reduction is the intersection of dot() and norm1() – Parallel for loop is the intersection of axpy() and mat-vec – We need a way of fusing kernels with these basic constructs. m kernels * n nodes = m*n m kernels + 2 constructs * n nodes = m + 2 * n

13Managed by UT-Battelle for the U.S. Department of Energy Presentation_name template void Node::parallel_for(int beg, int end, WDP workdata ); template WDP::RedouctinType Node::parallel_reduce(int beg, int end, WDP workdata ); template struct AxpyOp { const T * y; T * y; T alpha, beta; void execute(int i) { y[i] = alpha*x[i] + beta*y[i]; } }; template struct DotOp { typedef T ReductionType; const T * x, * y; T generate(int i) { return x[i]*y[i]; } T reduce(T x, T y) { return x + y; } }; Kokkos Compute Model Template meta-programming is the answer. – This is the same approach that Intel TBB takes. Node provides generic parallel constructs – Node::parallel_for, Node::parallel_reduce User fills the holes in the generic construct.

14Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Nodes and Kernels: How it comes together Kokkos developer/Vendor/Hero develops nodes: User develops kernels for parallel constructs. Template meta-programming does the rest: – TBBNode >::parallel_reduce – CUDANode >::parallel_for Composition is compile-time – OpenMPNode + AxpyOp equivalent to hand-coded OpenMP Axpy. – May not always be able to achieve this feat. TBBNode TPINode RoadRunnerNode CUDANode SerialNode YourNodeHere

15Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Kokkos Linear Algebra Library A subpackage of Kokkos providing a set of data structures and kernels for local parallel linear algebra objects. Coded to the Kokkos Parallel Node API Tpetra (global) objects consist of a Comm and a corresponding (local) Kokkos object. Implementing a new Node ports Tpetra without any changes to Tpetra. T Tpetra::Vector ::dot(Tpetra::Vector v) { T lcl = this->lclVec_->dot( v.lclVec_ ); return comm_->reduceAll (SUM, lcl); }

Teuchos Memory Management Suite A User Perspective Chris Baker/ORNL TUG 2009 November CSRI

17Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Teuchos Memory Management The Teuchos utility package provides a number of memory management classes: – RCP: reference counted pointer – ArrayRCP: reference counted array – ArrayView: encapsulates the length of and pointer to an array – Array: dynamically sized array Tpetra/Kokkos utilize these classes in place of raw pointers for: – writing bug-free code – writing simple code with simple interfaces

18Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Teuchos::RCP RCP is a reference-counted smart pointer – Provides runtime protection against null dereference – Provides automatic garbage collection – Necessary in the context of exceptions. Semantics are those of C pointer Tpetra use: – Tracking the ownership of dynamically created objects – Tpetra::Map objects always passed by RCP. – Dynamically created objects always encapsulated in RCP: RCP Vector::getSubView(...) Non-persisting situations allow efficient Teuchos::Ptr.

19Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Teuchos::ArrayRCP ArrayRCP is a reference-counted smart array – T* holds double duty in C: pointer and pointer to array – RCP is for the former; ARCP is for the latter Semantics are those of C array/pointer – access operators: [] * -> – arithmetic operators: = -= – all operations are bounds-checked in debug mode – iterators are available for optimal release performance Tpetra/Kokkos use: – Allocated arrays always encapsulated in ARCP before return. – Used heavily in Kokkos for compute buffers and their views.

20Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Example: ARCP and Kokkos Buffers The use of Teuchos::ArrayRCP greatly simplifies the management of compute buffers in the Kokkos memory model. In the absence of a smart pointer, the Node would need to provide a deleteBuffer() method as well. – Would need to be manually called by user. – This requires the ability to identify when the buffer can be freed. – ArrayRCP allows Node to register a custom, Node-appropriate deallocator and additional bookkeeping data. ArrayRCP Node::allocBuffer (size_t sz);

21Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Example: ARCP and Kokkos Buffers – In the absence of ArrayRCP, this method requires that the user “release” the view to enable any necessary write-back to device memory. This requires manually tracking when the view has expired. Instead, Node can register a custom deallocator for the ArrayRCP that will perform the write-back or other necessary bookkeeping. – This is especially helpful in the context of Tpetra. Tpetra::MultiVector::get1dVew() returns a host view of class data encapsulated in an ArrayRCP with appropriate deallocator. As a result, Tpetra user isn’t exposed to Kokkos Node and doesn’t have to manually release the view. ArrayRCP Node::viewBuffer (ArrayRCP buff);

22Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Teuchos::ArrayView RCP is sometimes overkill; non-persisting relationships can get away with Ptr. Non-persisting relationships of array data similarly utilize the ArrayView class. – This class basically encapsulate a pointer and a size. – Supports a subset of C array semantics Optimized build results in very fast code. – No garbage collection overhead. – Iterators become C pointers. Well integrated with other classes – Easily returned by ArrayRCP and Array

23Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Teuchos::Array Array is a replacement for std::vector. The benefit of Array is integration with other Teuchos memory classes. vector data(...); int * myalloc = NULL; myalloc = func2( &vector[offset], size ); int * func2(int A[], int length) { int sum = accumulate( A, A+length, 0 ); return new int[sum]; } Array data(...); ARCP myalloc; Myalloc = Func2( data(offset,size) ); ArrayRCP func2(ArrayView A) { int sum = accumulate( A.begin(), A.end(), 0 ); return arcp (sum); }

24Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Benefits of use Initial release of Tpetra contained no pointers: – Replaced by RCP, ArrayRCP or appropriate iterator – Zero memory overhead w.r.t Epetra. – Almost made me a lazier developer Debugging abilities are excellent: – Extends beyond normal bounds checking; can put additional constraints on memory access. – Runtime build results in code that is as fast as C. These memory utilities are unique to Trilinos. – Research-level capability – Production-level quality