Building the Support for Radar Processing Across Memory Hierarchies:

Building the Support for Radar Processing Across Memory Hierarchies:
Paper: Building the Support for Radar Processing Across Memory Hierarchies: On the Development of an Array Class with Shape using C++ Expression Templates Slide 1 Our paper, “Building Support for Radar Processing across Memory Hierarchies” involves the development and testing of a new class which interfaces with the PETE expression template library. Point out the main issues to get to the efficient algorithm: non-materilaization psi rules (on indexing operations (i.e. take, drop, reverse) mapping to memory / processors Motivation: The objective of the paper / and more generally our research is to enable efficient / fast array computations. In our paper, we take steps toward this goal. In order to do this, we wrote a specialized multi-dimensional array class which works with pete the expression template engine. This was needed in order to give pete n-dimensional capability. It will also serve as a platform to integrate Psi into PETE. As part of our paper, we tested the class to show that the performance of our class using pete was comparable to the performance of hand coded C for the same multi-dimensional array operations. Our strategy of extending pete to use psi calculus rules significantly speeds up computations by reducing intermediate computations to index manipulations (putting it briefly) which has the added effect of elimination many intermediate arrays used to store intermediate computations. It also enables mapping a problem to processor and memory hierarchies. PETE itself uses templated types to represent expression trees to remove intermediate storage objects. Our approach can combine the high performance and programmability together. The ultimate objective of our research is to exploit the capabilities of C++ expression templates and Psi-Calculus in order to facilitate fast TD analysis with enhanced programmability. Expression Templates enable fast computation of scalar operations by eliminating intermediate memory objects. This capability is implemented in the Object Oriented C++ language, rather than C. This capability can be extended to Array operations and as our paper shows, can be extended to N-dimensional arrays without performance degradation. Our new class, with PETE, provides a platform on which Psi Calculus operations can be implemented. The idea of integrating Psi rules into PETE was presented last year at HPEC. As was demonstrated in the paper “Monolithic compiler experiments using C++ expression Templates by Mulllin, Rutedge and Bond, in last years HPEC Workshop, Psi Calculus rules enables complementary optimizations on array operations. They showed that psi calculus operations such as (take, drop, reverse) could be implemented with expression templates. This is important to many scientific programming applications such as dsp computations. Strategies - Various strategies have been used in this pursuit. Our strategy is different from these strategies, but is related to the expression template strategies used in MTL and PETE. Our motivating application is synthetic aperture radar and its many applications (i.e. target detection, or continuously observing phenomena such as seismic movement, ocean current) Specifically the time-domain (TD) analysis, which is the most straight forward and accurate SAR analysis. Time Domain Analysis is also the most computationally intensive SAR analysis. As a result, it can only be used with SAR data of limited size. As size and resolution requirements increase, this becomes prohibitive. Therefore, faster computational techniques can facilitate this method. The next slide shows how a TD Convolution computation is mapped to memory hierarchies. This process is performed by hand using psi calculus rules. The subsequent slide will show how this process can be mechanized with additional psi-ops and non-materializations via pete expression templates. by Lenore R. Mullin, Xingmin Luo and Lawrence A. Bush

Processor / Memory Mapping
Σ I=0 CACHE (T H ) -1 … < > < > < > < > < > < > < > < > < > < > X < > < > < > < > < > < > < > < > processor 1 This slide shows the mapping of a TD Convolution to processor and memory. This reflects the application of a series of Psi Calculus rules implemented by hand, presented in the paper “Efficient Radar Processing Via Array and Index Algebras by Mullin, Hunt and Rosenkrantz from the Proceedings of the 1st Workshop on Optimizations for DSP and Embedded Systems in 2003. Benefits of MOA (Mathematics of Arrays) and Psi Calculus – A processor/memory hierarchy can be modeled by reshaping data using an extra dimension for each level. Composition of monolithic operations can be re-expressed as compositions of operations on smaller data granularities that; match memory hierarchy levels and avoid materialization of intermediate arrays Algorithms cam be automatically, algebraically, transformed to reflect array reshaping shown above. This approach facilitates programming or mechanizing the reshaping for computer computation. For any given array expression, reduction rules from the Psi Calculus can be applied in a mechanical process. According to Prof. Mullin’s “A Mathematics of Arrays” PhD thesis, this is guaranteed to produce an implementation having the least possible number of memory reads and writes. Basically, this slide shows a 2-vector problem, restructured. The restructuring requires higher dimensions. In this representation, the partitioning of the problem to be performed on parallel processors is represented vertically. The splitting of the problem using an arbitrary cache size is represented by the summation notation. Basically, we have a vector X and H where H is the filter or the vector <15 … 10> and X, shown as <1 … 9> is the SAR data padded with zeros to match the filter array size. The blocks on the right represent breaking the problem in the time domain. It is essentially a shift operation. The problem is then broken up to be computed on multiple processors, in this example. When you break up the computation over processors, you do it in a way that minimizes communication, in this case be break it up by rows. In this example, the problem is broken up so that the upper blocks are computed on one processor and the lower blocks on another with no IPC. The problem is further broken up so as to integrate a cache loop into the problem. The filter is partitioned into two shorter rows (as an example to reflect a restricted cache size). Basically, the filter <15 … 10> is split into 2 rows. Each row fits into the cache and is computed efficiently with no time wasting page faults. The idea of our research is to mechanize the psi rules along with fast N-dimensional array computations. This requires: adding shape and psi calculus rules. Our array class provides part of the platform, with pete, for mechanizing these rules. Integrating additional Psi rules into pete such as the rules required for the dimension lifting and partitioning shown above is additionally important. The rules were demonstrated manually. Mechanizing these manipulations is the key to faster computations. Our partial solution to the problem demonstrates that we do not have any performance degradation by supporting shapes in our array class.

+ C B A <PETE> Shape
As part of our research, we defined an N-dimensional array class with shape. In order to support the mechanization operations such as these (slide 2). The new array class extends the support for array operations in PETE. The essence of our Array class is the shape notion. The shape notion represents the N-dimensional array shape; the size of each of the Array dimensions. In our class, this is passed in as an STL vector. The array class transforms the underlying linear storage container into an N-dimensional Array representation and facilitates Unary and Binary operations on the shaped Array. The Array class integrates many PETE constructs to facilitate this. PETE is a library that facilitates loop unrolling for computations… which greatly improves speed … as it removes most temporary arrays from the computation. This slide then gives you a glimpse of the non-materialization optimizations facilitated by PETE. If we have an N-dimensional Array computation A+B+C (the result assigned to another like Array) It is represented as a left associative expression tree (or abstract syntax tree) such as this (point to slide). Show how it is evaluated (tree) The nodes of the tree are just templated types … they are not actually evaluated. The actual evaluation is done by the overloaded assignment operator rather than cascading operator overloading. The expression is represented as embedded C++ template types as shown (point to slide). The primary expression, shown in white, is of type Expression, and is comprised of many sub-types. It is essentially a binary node which is compressed of an operation (addition in this case), a reference type with a templated array sub-type, and another binary node which resolves to an array reference type. Thus, it represents the addition of two arrays. The second sub-expression (represented in green, represents the A+B portion of the expression tree. The point of all this is to represent the expression as templated types which are ultimately resolved by the overloaded assignment operator. Normally, c++ resolves these operations at each step, however, PETE forces them to be resolved only when assigned to another array, thus eliminating intermediate storage objects. Array Class: As I said before, our Array class uses these constructs and facilitates these operations on N-dimensional arrays. We tested the performance of our class. A simplified graph of our results is shown above. Essentially, the red line is the performance of hand coded C on the above expression. The black line is the performance of C++ (done normally) on the above expression. The brown line is the performance of our array class on the above expression and the blue line (which almost covers the brown line) is the performance of PETE, with a manual implementation (with out the use of our class) on the expression. It shows that our class performs almost as well as hand coded c, much better than normally implemented c++ and as well as a manual pete implementation. Our experimental results show promising performance on a C++ platform that is more programmable. The benefit here is fast computation with more programmability by using object oriented constructs. This also provides a platform for implementing psi calculus rules to further improve the computational speed (to be done in the future). The Psi calculus rules rewrite the AST often reducing operations to mere index manipulations. As well as facilitating memory and processor mapping. In summary, the system will; allow n-dimensional computations, unroll loops – reducing the number of computations performed, remove temporaries (via pete and psi-ops) and map the problem to memory hierarchies. const Expression <BinaryNode<OpAdd, Reference<Array> , BinaryNode<OpAdd, Reference<Array> , Reference<Array> > > > &expr = A + B + C

Building the Support for Radar Processing Across Memory Hierarchies:

Similar presentations

Presentation on theme: "Building the Support for Radar Processing Across Memory Hierarchies:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Building the Support for Radar Processing Across Memory Hierarchies:

Similar presentations

Presentation on theme: "Building the Support for Radar Processing Across Memory Hierarchies:"— Presentation transcript:

Similar presentations

About project

Feedback