Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November CSRI.

Similar presentations


Presentation on theme: "Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November CSRI."— Presentation transcript:

1 Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

2 2Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Introducing Tpetra and Kokkos Tpetra provides a next generation implementation of the Petra Object Model. – This is a framework for distributed linear algebra objects. – Tpetra is a successor Epetra. Kokkos is an API for programming to a generic parallel node. – Kokkos memory model allows code to be targeted to traditional ( CPU ) and non-traditional ( accelerated ) nodes. – Kokkos computational model provides a set of constructs for parallel computing operations.

3 3Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra Organization Tpetra follows the Petra Object Model currently implemented in Epetra: – Map describes the distribution of object data across nodes. – Teuchos::Comm abstracts internode communication. – Import, Export, Distributor utility classes facilitate efficient data transfer. – Operator, RowMatrix, RowGraph provide abstract interfaces. – Vector, MultiVector, CrsGraph, CrsMatrix are concrete implementations that are the workhorse of Tpetra-centered codes. Any class with significant data is templated. Any class with significant computation uses Kokkos.

4 4Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra vs. Epetra Most of the functionality of Epetra is present in Tpetra. Some differences prohibit a “find-replace” migration: EpetraTpetra Epetra_MpiComm comm(...); Epetra_Map map(numGlobal, 0, comm); Epetra_CrsMatrix A( Copy, map, &nnz, true ); Epetra_Vector x(map), y(map); A->Apply(x,y); RCP comm = rcp(...); Map map(numGlobal, 0, comm); CrsMatrix A( rcpFromRef(map), nnz, StaticProfile ); Vector x( rcpFromRef(map) ), y( rcpFromRef(map) ); A->apply(x,y); - Minor interface changes - Dependency on Kokkos package - Introduction of templated classes

5 5Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra Templated Classes A limitation of Epetra is that the implementation is tied to double and int. – Deployment of Epetra discourages significant modifications. – Published interface limits the possible implementation changes. Clean slate and compiler availability allow Tpetra to address this via template parameters to classes. This provides numerous capability extensions: – No 4GB limit: surpassing int enables arbitrarily large problems. – Arbitrary scalar types: float, complex, matrix, qd_real – Greater efficiency.

6 6Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra Basic Template Parameters Three primary template arguments: – LocalOrdinal, GlobalOrdinal, Scalar Scalar enables the description of numerical objects over different fields. – Any mathematically well-defined type is supported. – Additionally, require support under Teuchos::ScalarTraits and Teuchos::SerializationTraits. LocalOrdinal describes local element indices. – Intended to enable efficiency; should be chosen as small as possible. GlobalOrdinal describes global element indices. – Intended to enable larger problem sizes. – Decoupling necessary when the number of nodes is large.

7 7Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra Template Examples Map global_size_t getGlobalNumElements() size_t getNodeNumElements() LocalOrdinal getLocalElement(GlobalOrdinal gid) GlobalOrdinal getGlobalElement(LocalOrdinal lid) CrsMatrix global_size_t getGlobalNumEntries() size_t getNodeNumEntries() void getGlobalRowView(GlobalOrdinal gid, ArrayRCP &inds, ArrayRCP &vals)

8 8Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Tpetra Advanced Template Parameters Other template arguments exist to provide additional flexibility in Tpetra object implementation: – Node template argument specifies a Kokkos node. – Local data structures and implementations also flexible. Example: CrsMatrix ScalarField for matrix values LOintType of local indices GOLOType of global indices NodeKokkos::DefaultNodeKokkos node for local operations LclMatVecKokkos::DefaultSparseMultiply Implementation of local sparse mat-vec LclMatSolveKokkos::DefaultSparseSolve Implementation of local sparse solve

9 9Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Kokkos Parallel Node API Want: minimize the effort needed to port Tpetra The goal of Kokkos is to allow code, once written, to be run on any parallel node, regardless of architecture. Difficulties are many  Difficulty #1: Many different memory architectures – Node may have multiple, disjoint memory spaces. – Optimal performance may require special memory placement. Difficulty #2: Kernels must be tailored to architecture – Implementation of optimal kernel will vary between archs – No universal binary  need for separate compilation paths

10 10Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Kokkos Node API Kokkos provides two components: – Kokkos memory model addresses Difficulty #1 Allocation, deallocation and efficient access of memory compute buffer: special memory allocation used exclusively for parallel computation – Kokkos compute model addresses Difficulty #2 Description of kernels for parallel execution on a node Provides stubs for common parallel work constructs – Parallel for loop – Parallel reduction Supporting a new platform only a matter of implementing these models, i.e., implementing a new Node object.

11 11Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Kokkos Memory Model A generic node model must at least – support the scenario involving distinct memory regions – allow efficient memory access under traditional scenarios Node provides the following memory handling routines: ArrayRCP Node::allocBuffer (size_t sz); void Node::copyToBuffer (ArrayView src, ArrayRCP dest); void Node::copyFromBuffer (ArrayRCP src, ArrayView dest); ArrayRCP Node::viewBuffer (ArrayRCP buff); void Node::readyBuffer (ArrayRCP buff);

12 12Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Kokkos Compute Model Have to find the correct level for programming the node: – Too low: code dot(x,y) for each node Too much work to move to a new platform. Effort of writing dot() duplicates that of norm1() – Too high: code dot(x,y) for all nodes. Can’t exploit hardware features. API becomes a programming language without a compiler. Somewhere in the middle: – Parallel reduction is the intersection of dot() and norm1() – Parallel for loop is the intersection of axpy() and mat-vec – We need a way of fusing kernels with these basic constructs. m kernels * n nodes = m*n m kernels + 2 constructs * n nodes = m + 2 * n

13 13Managed by UT-Battelle for the U.S. Department of Energy Presentation_name template void Node::parallel_for(int beg, int end, WDP workdata ); template WDP::RedouctinType Node::parallel_reduce(int beg, int end, WDP workdata ); template struct AxpyOp { const T * y; T * y; T alpha, beta; void execute(int i) { y[i] = alpha*x[i] + beta*y[i]; } }; template struct DotOp { typedef T ReductionType; const T * x, * y; T generate(int i) { return x[i]*y[i]; } T reduce(T x, T y) { return x + y; } }; Kokkos Compute Model Template meta-programming is the answer. – This is the same approach that Intel TBB takes. Node provides generic parallel constructs – Node::parallel_for, Node::parallel_reduce User fills the holes in the generic construct.

14 14Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Nodes and Kernels: How it comes together Kokkos developer/Vendor/Hero develops nodes: User develops kernels for parallel constructs. Template meta-programming does the rest: – TBBNode >::parallel_reduce – CUDANode >::parallel_for Composition is compile-time – OpenMPNode + AxpyOp equivalent to hand-coded OpenMP Axpy. – May not always be able to achieve this feat. TBBNode TPINode RoadRunnerNode CUDANode SerialNode YourNodeHere

15 15Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Kokkos Linear Algebra Library A subpackage of Kokkos providing a set of data structures and kernels for local parallel linear algebra objects. Coded to the Kokkos Parallel Node API Tpetra (global) objects consist of a Comm and a corresponding (local) Kokkos object. Implementing a new Node ports Tpetra without any changes to Tpetra. T Tpetra::Vector ::dot(Tpetra::Vector v) { T lcl = this->lclVec_->dot( v.lclVec_ ); return comm_->reduceAll (SUM, lcl); }

16 Teuchos Memory Management Suite A User Perspective Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

17 17Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Teuchos Memory Management The Teuchos utility package provides a number of memory management classes: – RCP: reference counted pointer – ArrayRCP: reference counted array – ArrayView: encapsulates the length of and pointer to an array – Array: dynamically sized array Tpetra/Kokkos utilize these classes in place of raw pointers for: – writing bug-free code – writing simple code with simple interfaces

18 18Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Teuchos::RCP RCP is a reference-counted smart pointer – Provides runtime protection against null dereference – Provides automatic garbage collection – Necessary in the context of exceptions. Semantics are those of C pointer Tpetra use: – Tracking the ownership of dynamically created objects – Tpetra::Map objects always passed by RCP. – Dynamically created objects always encapsulated in RCP: RCP Vector::getSubView(...) Non-persisting situations allow efficient Teuchos::Ptr.

19 19Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Teuchos::ArrayRCP ArrayRCP is a reference-counted smart array – T* holds double duty in C: pointer and pointer to array – RCP is for the former; ARCP is for the latter Semantics are those of C array/pointer – access operators: [] * -> – arithmetic operators: + - ++ -- += -= – all operations are bounds-checked in debug mode – iterators are available for optimal release performance Tpetra/Kokkos use: – Allocated arrays always encapsulated in ARCP before return. – Used heavily in Kokkos for compute buffers and their views.

20 20Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Example: ARCP and Kokkos Buffers The use of Teuchos::ArrayRCP greatly simplifies the management of compute buffers in the Kokkos memory model. In the absence of a smart pointer, the Node would need to provide a deleteBuffer() method as well. – Would need to be manually called by user. – This requires the ability to identify when the buffer can be freed. – ArrayRCP allows Node to register a custom, Node-appropriate deallocator and additional bookkeeping data. ArrayRCP Node::allocBuffer (size_t sz);

21 21Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Example: ARCP and Kokkos Buffers – In the absence of ArrayRCP, this method requires that the user “release” the view to enable any necessary write-back to device memory. This requires manually tracking when the view has expired. Instead, Node can register a custom deallocator for the ArrayRCP that will perform the write-back or other necessary bookkeeping. – This is especially helpful in the context of Tpetra. Tpetra::MultiVector::get1dVew() returns a host view of class data encapsulated in an ArrayRCP with appropriate deallocator. As a result, Tpetra user isn’t exposed to Kokkos Node and doesn’t have to manually release the view. ArrayRCP Node::viewBuffer (ArrayRCP buff);

22 22Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Teuchos::ArrayView RCP is sometimes overkill; non-persisting relationships can get away with Ptr. Non-persisting relationships of array data similarly utilize the ArrayView class. – This class basically encapsulate a pointer and a size. – Supports a subset of C array semantics Optimized build results in very fast code. – No garbage collection overhead. – Iterators become C pointers. Well integrated with other classes – Easily returned by ArrayRCP and Array

23 23Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Teuchos::Array Array is a replacement for std::vector. The benefit of Array is integration with other Teuchos memory classes. vector data(...); int * myalloc = NULL; myalloc = func2( &vector[offset], size ); int * func2(int A[], int length) { int sum = accumulate( A, A+length, 0 ); return new int[sum]; } Array data(...); ARCP myalloc; Myalloc = Func2( data(offset,size) ); ArrayRCP func2(ArrayView A) { int sum = accumulate( A.begin(), A.end(), 0 ); return arcp (sum); }

24 24Managed by UT-Battelle for the U.S. Department of Energy Presentation_name Benefits of use Initial release of Tpetra contained no pointers: – Replaced by RCP, ArrayRCP or appropriate iterator – Zero memory overhead w.r.t Epetra. – Almost made me a lazier developer Debugging abilities are excellent: – Extends beyond normal bounds checking; can put additional constraints on memory access. – Runtime build results in code that is as fast as C. These memory utilities are unique to Trilinos. – Research-level capability – Production-level quality


Download ppt "Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November CSRI."

Similar presentations


Ads by Google