Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November CSRI
2Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Introducing Tpetra and Kokkos Tpetra provides a next generation implementation of the Petra Object Model. – This is a framework for distributed linear algebra objects. – Tpetra is a successor Epetra. Kokkos is an API for programming to a generic parallel node. – Kokkos memory model allows code to be targeted to traditional ( CPU ) and non-traditional ( accelerated ) nodes. – Kokkos computational model provides a set of constructs for parallel computing operations.
3Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra Organization Tpetra follows the Petra Object Model currently implemented in Epetra: – Map describes the distribution of object data across nodes. – Teuchos::Comm abstracts internode communication. – Import, Export, Distributor utility classes facilitate efficient data transfer. – Operator, RowMatrix, RowGraph provide abstract interfaces. – Vector, MultiVector, CrsGraph, CrsMatrix are concrete implementations that are the workhorse of Tpetra-centered codes. Any class with significant data is templated. Any class with significant computation uses Kokkos.
4Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra vs. Epetra Most of the functionality of Epetra is present in Tpetra. Some differences prohibit a “find-replace” migration: EpetraTpetra Epetra_MpiComm comm(...); Epetra_Map map(numGlobal, 0, comm); Epetra_CrsMatrix A( Copy, map, &nnz, true ); Epetra_Vector x(map), y(map); A->Apply(x,y); RCP comm = rcp(...); Map map(numGlobal, 0, comm); CrsMatrix A( rcpFromRef(map), nnz, StaticProfile ); Vector x( rcpFromRef(map) ), y( rcpFromRef(map) ); A->apply(x,y); - Minor interface changes - Dependency on Kokkos package - Introduction of templated classes
5Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra Templated Classes A limitation of Epetra is that the implementation is tied to double and int. – Deployment of Epetra discourages significant modifications. – Published interface limits the possible implementation changes. Clean slate and compiler availability allow Tpetra to address this via template parameters to classes. This provides numerous capability extensions: – No 4GB limit: surpassing int enables arbitrarily large problems. – Arbitrary scalar types: float, complex, matrix, qd_real – Greater efficiency.
6Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra Basic Template Parameters Three primary template arguments: – LocalOrdinal, GlobalOrdinal, Scalar Scalar enables the description of numerical objects over different fields. – Any mathematically well-defined type is supported. – Additionally, require support under Teuchos::ScalarTraits and Teuchos::SerializationTraits. LocalOrdinal describes local element indices. – Intended to enable efficiency; should be chosen as small as possible. GlobalOrdinal describes global element indices. – Intended to enable larger problem sizes. – Decoupling necessary when the number of nodes is large.
7Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra Template Examples Map global_size_t getGlobalNumElements() size_t getNodeNumElements() LocalOrdinal getLocalElement(GlobalOrdinal gid) GlobalOrdinal getGlobalElement(LocalOrdinal lid) CrsMatrix global_size_t getGlobalNumEntries() size_t getNodeNumEntries() void getGlobalRowView(GlobalOrdinal gid, ArrayRCP &inds, ArrayRCP &vals)
8Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra Advanced Template Parameters Other template arguments exist to provide additional flexibility in Tpetra object implementation: – Node template argument specifies a Kokkos node. – Local data structures and implementations also flexible. Example: CrsMatrix ScalarField for matrix values LOintType of local indices GOLOType of global indices NodeKokkos::DefaultNodeKokkos node for local operations LclMatVecKokkos::DefaultSparseMultiply Implementation of local sparse mat-vec LclMatSolveKokkos::DefaultSparseSolve Implementation of local sparse solve
9Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Kokkos Parallel Node API Want: minimize the effort needed to port Tpetra The goal of Kokkos is to allow code, once written, to be run on any parallel node, regardless of architecture. Difficulties are many Difficulty #1: Many different memory architectures – Node may have multiple, disjoint memory spaces. – Optimal performance may require special memory placement. Difficulty #2: Kernels must be tailored to architecture – Implementation of optimal kernel will vary between archs – No universal binary need for separate compilation paths
10Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Kokkos Node API Kokkos provides two components: – Kokkos memory model addresses Difficulty #1 Allocation, deallocation and efficient access of memory compute buffer: special memory allocation used exclusively for parallel computation – Kokkos compute model addresses Difficulty #2 Description of kernels for parallel execution on a node Provides stubs for common parallel work constructs – Parallel for loop – Parallel reduction Supporting a new platform only a matter of implementing these models, i.e., implementing a new Node object.
11Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Kokkos Memory Model A generic node model must at least – support the scenario involving distinct memory regions – allow efficient memory access under traditional scenarios Node provides the following memory handling routines: ArrayRCP Node::allocBuffer (size_t sz); void Node::copyToBuffer (ArrayView src, ArrayRCP dest); void Node::copyFromBuffer (ArrayRCP src, ArrayView dest); ArrayRCP Node::viewBuffer (ArrayRCP buff); void Node::readyBuffer (ArrayRCP buff);
12Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Kokkos Compute Model Have to find the correct level for programming the node: – Too low: code dot(x,y) for each node Too much work to move to a new platform. Effort of writing dot() duplicates that of norm1() – Too high: code dot(x,y) for all nodes. Can’t exploit hardware features. API becomes a programming language without a compiler. Somewhere in the middle: – Parallel reduction is the intersection of dot() and norm1() – Parallel for loop is the intersection of axpy() and mat-vec – We need a way of fusing kernels with these basic constructs. m kernels * n nodes = m*n m kernels + 2 constructs * n nodes = m + 2 * n
13Managed by UT-Battelle for the U.S. Department of Energy Presentation_name template void Node::parallel_for(int beg, int end, WDP workdata ); template WDP::RedouctinType Node::parallel_reduce(int beg, int end, WDP workdata ); template struct AxpyOp { const T * y; T * y; T alpha, beta; void execute(int i) { y[i] = alpha*x[i] + beta*y[i]; } }; template struct DotOp { typedef T ReductionType; const T * x, * y; T generate(int i) { return x[i]*y[i]; } T reduce(T x, T y) { return x + y; } }; Kokkos Compute Model Template meta-programming is the answer. – This is the same approach that Intel TBB takes. Node provides generic parallel constructs – Node::parallel_for, Node::parallel_reduce User fills the holes in the generic construct.
14Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Nodes and Kernels: How it comes together Kokkos developer/Vendor/Hero develops nodes: User develops kernels for parallel constructs. Template meta-programming does the rest: – TBBNode >::parallel_reduce – CUDANode >::parallel_for Composition is compile-time – OpenMPNode + AxpyOp equivalent to hand-coded OpenMP Axpy. – May not always be able to achieve this feat. TBBNode TPINode RoadRunnerNode CUDANode SerialNode YourNodeHere
15Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Kokkos Linear Algebra Library A subpackage of Kokkos providing a set of data structures and kernels for local parallel linear algebra objects. Coded to the Kokkos Parallel Node API Tpetra (global) objects consist of a Comm and a corresponding (local) Kokkos object. Implementing a new Node ports Tpetra without any changes to Tpetra. T Tpetra::Vector ::dot(Tpetra::Vector v) { T lcl = this->lclVec_->dot( v.lclVec_ ); return comm_->reduceAll (SUM, lcl); }
Teuchos Memory Management Suite A User Perspective Chris Baker/ORNL TUG 2009 November CSRI
17Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Teuchos Memory Management The Teuchos utility package provides a number of memory management classes: – RCP: reference counted pointer – ArrayRCP: reference counted array – ArrayView: encapsulates the length of and pointer to an array – Array: dynamically sized array Tpetra/Kokkos utilize these classes in place of raw pointers for: – writing bug-free code – writing simple code with simple interfaces
18Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Teuchos::RCP RCP is a reference-counted smart pointer – Provides runtime protection against null dereference – Provides automatic garbage collection – Necessary in the context of exceptions. Semantics are those of C pointer Tpetra use: – Tracking the ownership of dynamically created objects – Tpetra::Map objects always passed by RCP. – Dynamically created objects always encapsulated in RCP: RCP Vector::getSubView(...) Non-persisting situations allow efficient Teuchos::Ptr.
19Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Teuchos::ArrayRCP ArrayRCP is a reference-counted smart array – T* holds double duty in C: pointer and pointer to array – RCP is for the former; ARCP is for the latter Semantics are those of C array/pointer – access operators: [] * -> – arithmetic operators: = -= – all operations are bounds-checked in debug mode – iterators are available for optimal release performance Tpetra/Kokkos use: – Allocated arrays always encapsulated in ARCP before return. – Used heavily in Kokkos for compute buffers and their views.
20Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Example: ARCP and Kokkos Buffers The use of Teuchos::ArrayRCP greatly simplifies the management of compute buffers in the Kokkos memory model. In the absence of a smart pointer, the Node would need to provide a deleteBuffer() method as well. – Would need to be manually called by user. – This requires the ability to identify when the buffer can be freed. – ArrayRCP allows Node to register a custom, Node-appropriate deallocator and additional bookkeeping data. ArrayRCP Node::allocBuffer (size_t sz);
21Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Example: ARCP and Kokkos Buffers – In the absence of ArrayRCP, this method requires that the user “release” the view to enable any necessary write-back to device memory. This requires manually tracking when the view has expired. Instead, Node can register a custom deallocator for the ArrayRCP that will perform the write-back or other necessary bookkeeping. – This is especially helpful in the context of Tpetra. Tpetra::MultiVector::get1dVew() returns a host view of class data encapsulated in an ArrayRCP with appropriate deallocator. As a result, Tpetra user isn’t exposed to Kokkos Node and doesn’t have to manually release the view. ArrayRCP Node::viewBuffer (ArrayRCP buff);
22Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Teuchos::ArrayView RCP is sometimes overkill; non-persisting relationships can get away with Ptr. Non-persisting relationships of array data similarly utilize the ArrayView class. – This class basically encapsulate a pointer and a size. – Supports a subset of C array semantics Optimized build results in very fast code. – No garbage collection overhead. – Iterators become C pointers. Well integrated with other classes – Easily returned by ArrayRCP and Array
23Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Teuchos::Array Array is a replacement for std::vector. The benefit of Array is integration with other Teuchos memory classes. vector data(...); int * myalloc = NULL; myalloc = func2( &vector[offset], size ); int * func2(int A[], int length) { int sum = accumulate( A, A+length, 0 ); return new int[sum]; } Array data(...); ARCP myalloc; Myalloc = Func2( data(offset,size) ); ArrayRCP func2(ArrayView A) { int sum = accumulate( A.begin(), A.end(), 0 ); return arcp (sum); }
24Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Benefits of use Initial release of Tpetra contained no pointers: – Replaced by RCP, ArrayRCP or appropriate iterator – Zero memory overhead w.r.t Epetra. – Almost made me a lazier developer Debugging abilities are excellent: – Extends beyond normal bounds checking; can put additional constraints on memory access. – Runtime build results in code that is as fast as C. These memory utilities are unique to Trilinos. – Research-level capability – Production-level quality