Building the Support for Radar Processing Across Memory Hierarchies:

Building the Support for Radar Processing Across Memory Hierarchies:
Paper: Building the Support for Radar Processing Across Memory Hierarchies: On the Development of an Array Class with Shape using C++ Expression Templates Our paper, “Building Support for Radar Processing Across Memory Hierarchies” involves the development and testing of a new class which interfaces with the PETE expression template library. The ultimate objective of our research is to exploit the capabilities of expression templates and Psi Calculus. Expression Templates enable fast computation of scalar operations by eliminating intermediate memory objects. This capability is implemented in the Object Oriented C++ language, rather than C. This capability can be extended to Array operations and as our paper shows, can be extended to N-dimensional arrays without performance degradation. Our new class, with PETE, provides a platform on which Psi Calculus operations can be implemented. As was demonstrated in the paper “Monolithic compiler experiments using C++ expression Templates by Mulllin, Rutedge and Bond, in last hears HPEC Workshop, Psi Calculus rules enables complementary optimizations on array operations. Motivation: Our motivating application is synthetic aperture radar and its many applications (i.e. target detection, or continuously observing phenomena such as seismic movement, ocean current) Specifically the time-domain (TD) analysis, which is the most straight forward and accurate SAR analysis. Time Domain Analysis is also the most computationally intensive SAR analysis. As a result, it can only be used with SAR data of limited size. As size and resolution requirements increase, this becomes prohibitive. Therefore, faster computational techniques can facilitate this method. X- Our approach can combine the high performance and programmability together. Related Algorithms: TD Convolution Mention mapping Radar algorithms to memory hierarchies. (ODES paper/talk given explains this mapping using hand designed and derived algorithms for Convolution using psi calclulus) The idea of the talk is to mechanize the process by building MOA and Psi calc into <PETE>. This requires: adding shape understanding that PETE creates an AST the psi calculus rules rewrite the AST This idea was presented last year at HPEC. In this talk we restructure the two vector algorithms for convolution, lift them to higher dimensions. With the support for shapes in to pete we will be able to mechanize everything. Although this is not yet done; demonstrate that we do not have any performance degradation by supporting shapes and an array class. The + example is ok since adding the other ops will not change performance. These things must be on the slides with a story as above radar memory hierarchies Define N-dimensional array class with shape in order to support the mechanization linear transformations in the Psi-Calculus. The new array class extends the support for array operations in PETE by defining the shape for the array class. We ran the experiment on two different platforms yet got the similar result: with PETE and our array class, we achieved similar performance as is obtained using C Future work may be in adding additional algorothm methods to enable other psi calculus operations Application: RADAR, SAR Application: start with Kong’s paper and argument. Point out the main issues to get to the efficient algorithm: non-materilaization psi rules (on indexing operations (i.e. take, drop, reverse) mapping to memory / processors The objective of the paper/ and more generally our project is to enable efficient / fast array computations. Applications - This is important to many scientific programming applications such as dsp computations. Strategies - Various strategies have been used in this pursuit, for example 1,2,3. Our strategy is different from these strategies, but is related to the expression template strategies used in MTL and PETE. Our Strategy Using PETE (portable expression template engine). PETE is a library that facilitates loop unrolling for computations. This greatly speeds it up because it removes most temporary arrays from the computation. Integrate Psi with PETE In our paper, we take steps toward this goal. In order to do this, we wrote a specialized multi-dimensional array class which works with pete. This was needed in order to give pete n-dimensional capability. It will also serve as a platform to integrate Psi into PETE. As part of our paper, we tested the class to show that the performance of our class using pete was comparable to the performance of hand coded C for the same multi-dimensional array operations. Our strategy wants to extend pete to use psi calculus rules. Psi calculus rules are one strategy to significantly speed up computations because it reduces intermediate computations to index manipulations (putting it briefly) which has the added effect of elimination many intermediate arrays used to store intermediate computations. i.e. td convolution and moa design Paper Summary: Why do we need shape.h? A: so that we can do multi dimensional arrays in pete. We also want our array class to be extendable to use psi calculus. An example of TD Convolution: TD convolution is used for (i.e.) : TD-C is used in Radar DSP. It can be used in various configurations. In other words, for differenct purposes or using different strategies, it can be used in conjunction with other methods to glean the desire or most information or cleanest information from the signal. For example, (A frequency domain de-convolution approach for transmitter noise cancellation is being developed. The time domain radar return from distributed clutter is the convolution of the coded transmit pulse and the distributed clutter field. By taking the Fast Fourier Transform (FFT) of the distributed clutter return, the IPN contribution of the noisy transmit waveform can be removed by dividing it by the frequency spectrum of the measured transmit waveform. An IFFT is used to return to the time domain for subsequent MTI processing.) One method to remove clutter uses the TD convolution of the coded transmit pulse and the distributed clutter field. Then it is FFTed and certain noise is removed by dividing it by the frequency spectrum of the waveform. Then an IFFT to revert to the time domain. By: Adaptive Distributed Clutter Improvement Factor (ADDCIF) John Hoffman, Louis Vasquez, Charles Farthing, and Clarence Ng Systems Engineering Group, Inc. by Lenore R. Mullin, Xingmin Luo and Lawrence A. Bush

Processor / Memory Mapping
Σ I=0 CACHE (T H ) -1 … < > < > < > < > < > < > < > < > < > < > X < > < > < > < > < > < > < > < > processor 1 HPEC ’02 ODES ’02 ODES: TD Convolution – Processor Split is represented vertically. An the splitting of the problem using and arbitrary cache size is represented by the sum – matrix notation. In this talk we restructure the two vector algorithms for convolution, lift them to higher dimensions. Benefits of MOA (Mathematics of Arrays) and Psi Calculus – A processor/memory hierarchy can be modeled by reshaping data using an extra dimension for each level. Composition of monolithic operations can be re-expressed as compositions of operations on smaller data granularities that; match memory hierarchy levels and avoid materialization of intermediate arrays Algorithms cam be automatically, algebraically, transformed to reflect array reshaping shown above. Facilitate programming expressed at a high level Facilitate intentional program designs and analysis. Facilitate portability and scalability. This approach is applicable to many problems in radar. For any given array expression, reduction rules from the Psi Calculus can be applied in a mechanical process guaranteed to produce an implementation having the least possible number of memory reads and writes (Prof. Mullin “A Mathematics of Arrays” PhD thesis 1998). This slide depicts dimension lifting of a TD convolution computation using psi calculus rules. Basically, we have a vector X and H where H is the filter or the vector <15 … 10> and X, shown as <1 … 9> is the SAR data padded with zeros to match the filter array size. The blocks on the right represent breaking the problem in the time domain. It is essentially a shift operation. The problem is then broken up to be computed on multiple processors, in this example. When you break up the computation over processors, you do it in a way that minimizes communication, in this case be break it up by rows. In this example, the problem is broken up so that the upper blocks are computed on one processor and the lower blocks on another with no IPC. The problem is further broken up so as to integrate a cache loop into the problem. The filter is partitioned into two shorter rows (as an example to reflect a restricted cache size). Basically, the filter <15 … 10> is split into 2 rows. Each row fits into the cache and is computed efficiently with no time wasting page faults. The idea of our research is to mechanize the psi rules along with fast N-dimensional array computations. The idea of integrating Psi rules into PETE was presented last year at HPEC. HPEC ’02, showed that they could implement psi calculus with expression templates (take, drop, reverse). This was integrated into pete. Integrating additional Psi rules into pete such as the rules required for the dimention lifting and partitioning shown above is additionally important. The rules were demonstrated manually. Mechanizing these manipulations is the key to faster computations. Our array class provides part of the platform, with pete, for mechanizing these rules. Additionally, our Array class facilitates computations on N-dimensional Arrays. Our experimental results show promising performance on a C++ platform that is more programmable.

Shape (our contribution)
+ C B A const Expression <BinaryNode<OpAdd, Reference<Array> , BinaryNode<OpAdd, Reference<Array> , Reference<Array> > > > &expr = A + B + C The essence of our Array class is the shape notion. The shape notion represents the N-dimensional array shape. In our class, this is passed in as an STL vector. The array class transforms the underlying linear storage container into an N-dimensional Array representation and facilitates Unary and Binary operations on the shaped Array. This slide then gives you a glimpse of the non-materialization optimizations facilitated by PETE. The Array class integrates many PETE constructs to facilitate this. If we have an N-dimensional Array computation A+B+C (the result assigned to another like Array) It is represented as a left associative experession tree such as this (point to slide). This is represented as embedded C++ template types as shown. The primary expression, shown in white, is of type Expression, and is comprised of many sub-types. It is essintially a binary node which is compresed of an operation (addition in this case), a reference type with a templeated array sub-type, and another binary node which resolves to an array reference type. Thus, it represents the addition of two arrays. The second sub-expression (represented in green, represents the A+B portion of the expression tree. The point of all this is to represent the expression as templated types which are ultimately resolved by the overloaded assignment operator. Normally, c++ resloves these operations at each step, however, PETE forces them to be resolved only when assigned to another array, thus elimination the need for intermediate storage objects. Array Class: As I said before, our Array class uses these constructs and facilitates these operations on N-dimensional arrays. We tested the performance of our class. A simplified graph of our results is shown above. Essentially, the red line is the performance of hand coded C on the above expression. The black line is the performance of C++ (done normally) on the above expression. The brown line is the performance of our array class on the above expression and the blue line (which almost covers the brown line) is the performance of PETE, with a manual implementation (with out the use of our class) on the expression. It shows that our class performs almost as well as hand coded c, much better than normally implemented c++ and as well as a manual pete implementation. The benefit here is fast computation with more programmability by using object oriented constructs. This also provides a platform for inplementing psi calculus rules to furthur improve the computational speed. Through other moa ops as well as memory and processor mapping. Show templated type for the A = B + C + D expression. Show how it is evaluated (tree) Explain that the nodes are just templated types and are not actually evaluated. The evaluation is done by the overloaded assignment operator rather than cascading operator overloading. With the support for shapes in pete we will be able to mechanize everything. Although this is not yet done; demonstrate that we do not have any performance degradation by supporting shapes and an array class. The + example is ok since adding the other ops will not change performance. These things must be on the slides with a story as above radar memory hierarchies Define N-dimensional array class with shape in order to support the mechanization linear transformations in the Psi-Calculus. The new array class extends the support for array operations in PETE by defining the shape for the array class. We ran the experiment on two different platforms yet got the similar result: with PETE and our array class, we achieved similar performance as is obtained using C Future work may be in adding additional algorothm methods to enable other psi calculus operations Application: RADAR, SAR Future Work: Related notes: The last step in making Vec3 PETE-compatible is to provide a way for PETE to assign to a Vec3 from an arbitrary expression. This is done by overloading operator= to take a PETE expression as input, and copy values into its owner: 064 template<class RHS> 065 Vec3 &operator=(const Expression<RHS> &rhs) 066 { 067 d[0] = forEach(rhs, EvalLeaf1(0), OpCombine()); 068 d[1] = forEach(rhs, EvalLeaf1(1), OpCombine()); 069 d[2] = forEach(rhs, EvalLeaf1(2), OpCombine()); return *this; 072 } The first thing to notice about this method is that it is templated on an arbitrary class RHS, but its single formal parameter has type Expression<RHS>. This combination means that the compiler can match it against anything that is wrapped in the generic PETE template Expression<>, and only against things that are wrapped in that way. The compiler cannot match against int, complex<short>, or GreatAuntJane_t, since these do not have the form Expression<RHS> for some type RHS. The forEach function is used to traverse expression trees. The first argument is the expression. The second argument is the leaf tag denoting the operation applied at the leaves. The third argument is a combiner tag, which is used to combine results at non-leaf nodes. By passing EvalLeaf1(0) in line 67, we are indicating that we want the Vec3s at the leaves to return the element at index 0. The LeafFunctor<Scalar<T>, EvalLeaf1> (defined inside of PETE) ensures that scalars return their value no matter the index. While EvalLeaf1 obtains values from the leaves, OpCombine takes these values and combines them according to the operators present at the non-leaf nodes. The result is that line 67 evaluates the expression on the right side of the assignment operator at index 0. Line 68 does this at index 1, and so on. Once evaluation is complete, operator= returns the Vec3 to which values have been assigned, in keeping with normal C++ conventions. Psi complements this (reverse). Psi will improve this (to remove temporaries) How to implement: 1-Iterator like concept. 2-Index composition using expression templates -> special scalar – like type that is copied by value but only performed once per array operation regardless of the matrix size. Removes more temporaries which pete cannot No intermediate computations on itself (i.e. reverse). Eliminates the entire loop (not merely reducing it to one (or the minimum number). Reduces loop to a constant time indexing operation rather than a loop calculation.

Building the Support for Radar Processing Across Memory Hierarchies:

Similar presentations

Presentation on theme: "Building the Support for Radar Processing Across Memory Hierarchies:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Building the Support for Radar Processing Across Memory Hierarchies:

Similar presentations

Presentation on theme: "Building the Support for Radar Processing Across Memory Hierarchies:"— Presentation transcript:

Similar presentations

About project

Feedback