Building the Support for Radar Processing Across Memory Hierarchies:

Slides:



Advertisements
Similar presentations
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Advertisements

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Implementation of Morton Layout for Large Arrays Presented by: Sharad Ratna Bajracharya Advisor: Prof. Larry Dunning 23 rd April 2004 Bowling Green State.
October 14-15, 2005Conformal Computing Geometry of Arrays: Mathematics of Arrays and  calculus Lenore R. Mullin Computer Science Department College.
Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.
Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.
Generative Programming. Automated Assembly Lines.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
Lrm-1 lrm 11/15/2015 University at Albany, SUNY Efficient Radar Processing Via Array and Index Algebras Lenore R. Mullin, Daniel J. Rosenkrantz, and Harry.
XYZ 11/21/2015 MIT Lincoln Laboratory Monolithic Compiler Experiments Using C++ Expression Templates* Lenore R. Mullin** Edward Rutledge Robert.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
U Albany SUNY 1 Outline Notes (not in presentation) Intro Overview of PETE (very high level) –What pete does. –Files – say what each file does Explain.
Class Design. Class Design The analysis phase determines what the implementation must do, and the system design.
University at Albany,SUNY lrm-1 lrm 6/28/2016 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,
Logical Database Design and the Rational Model
Advanced Computer Systems
Eigenfaces (for Face Recognition)
Query Optimization Heuristic Optimization
Programming with ANSI C ++
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
课程名 编译原理 Compiling Techniques
Optimization Code Optimization ©SoftMoore Consulting.
Array Array is a variable which holds multiple values (elements) of similar data types. All the values are having their own index with an array. Index.
Database Performance Tuning and Query Optimization
structures and their relationships." - Linus Torvalds
Outline Notes (not in presentation)
Object - Oriented Programming Language
Introduction to cosynthesis Rabi Mahapatra CSCE617
Syntax-Directed Translation
structures and their relationships." - Linus Torvalds
Memory Hierarchies.
Building the Support for Radar Processing Across Memory Hierarchies:
Compiler Back End Panel
Monolithic Compiler Experiments Using C++ Expression Templates*
Compiler Back End Panel
Multiple Aspect Modeling of the Synchronous Language Signal
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Introduction to Data Structures
Wavelet “Block-Processing” for Reduced Memory Transfers
Monolithic Compiler Experiments Using C++ Expression Templates*
Building the Support for Radar Processing Across Memory Hierarchies:
Building the Support for Radar Processing Across Memory Hierarchies:
(HPEC-LB) Outline Notes
Building the Support for Radar Processing Across Memory Hierarchies:
Building the Support for Radar Processing Across Memory Hierarchies:
WARM-UP 8 in. Perimeter = _____ 7 in. Area = _____ 12 in. 4 cm
Building the Support for Radar Processing Across Memory Hierarchies:
A Small and Fast IP Forwarding Table Using Hashing
Chapter 11 Database Performance Tuning and Query Optimization
Memory Efficient Radar Processing
<PETE> Shape Programmability
Outline Notes (not in presentation)
Object - Oriented Programming Language
Code Transformation for TLB Power Reduction
Object - Oriented Programming Language
Building the Support for Radar Processing Across Memory Hierarchies:
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
structures and their relationships." - Linus Torvalds
Introduction to Artificial Intelligence Lecture 22: Computer Vision II
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
CS 201 Compiler Construction
Presentation transcript:

Building the Support for Radar Processing Across Memory Hierarchies: Paper: Building the Support for Radar Processing Across Memory Hierarchies: On the Development of an Array Class with Shape using C++ Expression Templates Slide 1 Our paper, “Building Support for Radar Processing across Memory Hierarchies” involves the development and testing of a new class which interfaces with the PETE expression template library. Point out the main issues to get to the efficient algorithm: non-materilaization psi rules (on indexing operations (i.e. take, drop, reverse) mapping to memory / processors Motivation: The objective of the paper / and more generally our research is to enable efficient / fast array computations. In our paper, we take steps toward this goal. In order to do this, we wrote a specialized multi-dimensional array class which works with pete the expression template engine. This was needed in order to give pete n-dimensional capability. It will also serve as a platform to integrate Psi into PETE. As part of our paper, we tested the class to show that the performance of our class using pete was comparable to the performance of hand coded C for the same multi-dimensional array operations. Our strategy of extending pete to use psi calculus rules significantly speeds up computations by reducing intermediate computations to index manipulations (putting it briefly) which has the added effect of elimination many intermediate arrays used to store intermediate computations. It also enables mapping a problem to processor and memory hierarchies. PETE itself uses templated types to represent expression trees to remove intermediate storage objects. Our approach can combine the high performance and programmability together. The ultimate objective of our research is to exploit the capabilities of C++ expression templates and Psi-Calculus in order to facilitate fast TD analysis with enhanced programmability. Expression Templates enable fast computation of scalar operations by eliminating intermediate memory objects. This capability is implemented in the Object Oriented C++ language, rather than C. This capability can be extended to Array operations and as our paper shows, can be extended to N-dimensional arrays without performance degradation. Our new class, with PETE, provides a platform on which Psi Calculus operations can be implemented. The idea of integrating Psi rules into PETE was presented last year at HPEC. As was demonstrated in the paper “Monolithic compiler experiments using C++ expression Templates by Mulllin, Rutedge and Bond, in last years HPEC Workshop, Psi Calculus rules enables complementary optimizations on array operations. They showed that psi calculus operations such as (take, drop, reverse) could be implemented with expression templates. This is important to many scientific programming applications such as dsp computations. Strategies - Various strategies have been used in this pursuit. Our strategy is different from these strategies, but is related to the expression template strategies used in MTL and PETE. Our motivating application is synthetic aperture radar and its many applications (i.e. target detection, or continuously observing phenomena such as seismic movement, ocean current) Specifically the time-domain (TD) analysis, which is the most straight forward and accurate SAR analysis. Time Domain Analysis is also the most computationally intensive SAR analysis. As a result, it can only be used with SAR data of limited size. As size and resolution requirements increase, this becomes prohibitive. Therefore, faster computational techniques can facilitate this method. The next slide shows how a TD Convolution computation is mapped to memory hierarchies. This process is performed by hand using psi calculus rules. The subsequent slide will show how this process can be mechanized with additional psi-ops and non-materializations via pete expression templates. by Lenore R. Mullin, Xingmin Luo and Lawrence A. Bush

Processor / Memory Mapping Σ I=0 CACHE (T H ) -1 … < 12 11 10 > < 15 14 13 > < 15 14 13 > < 12 11 10 > < 0 0 1 > < 0 0 0 > < 0 0 0 > < 0 1 2 > < 1 2 3 > < 4 5 6 > X < 12 11 10 > < 15 14 13 > < 5 6 7 > < 2 3 4 > < 3 4 5 > < 6 7 8 > < 9 0 0 > < 0 0 0 > processor 1 This slide shows the mapping of a TD Convolution to processor and memory. This reflects the application of a series of Psi Calculus rules implemented by hand, presented in the paper “Efficient Radar Processing Via Array and Index Algebras by Mullin, Hunt and Rosenkrantz from the Proceedings of the 1st Workshop on Optimizations for DSP and Embedded Systems in 2003. Benefits of MOA (Mathematics of Arrays) and Psi Calculus – A processor/memory hierarchy can be modeled by reshaping data using an extra dimension for each level. Composition of monolithic operations can be re-expressed as compositions of operations on smaller data granularities that; match memory hierarchy levels and avoid materialization of intermediate arrays Algorithms cam be automatically, algebraically, transformed to reflect array reshaping shown above. This approach facilitates programming or mechanizing the reshaping for computer computation. For any given array expression, reduction rules from the Psi Calculus can be applied in a mechanical process. According to Prof. Mullin’s “A Mathematics of Arrays” PhD thesis, this is guaranteed to produce an implementation having the least possible number of memory reads and writes. Basically, this slide shows a 2-vector problem, restructured. The restructuring requires higher dimensions. In this representation, the partitioning of the problem to be performed on parallel processors is represented vertically. The splitting of the problem using an arbitrary cache size is represented by the summation notation. Basically, we have a vector X and H where H is the filter or the vector <15 … 10> and X, shown as <1 … 9> is the SAR data padded with zeros to match the filter array size. The blocks on the right represent breaking the problem in the time domain. It is essentially a shift operation. The problem is then broken up to be computed on multiple processors, in this example. When you break up the computation over processors, you do it in a way that minimizes communication, in this case be break it up by rows. In this example, the problem is broken up so that the upper blocks are computed on one processor and the lower blocks on another with no IPC. The problem is further broken up so as to integrate a cache loop into the problem. The filter is partitioned into two shorter rows (as an example to reflect a restricted cache size). Basically, the filter <15 … 10> is split into 2 rows. Each row fits into the cache and is computed efficiently with no time wasting page faults. The idea of our research is to mechanize the psi rules along with fast N-dimensional array computations. This requires: adding shape and psi calculus rules. Our array class provides part of the platform, with pete, for mechanizing these rules. Integrating additional Psi rules into pete such as the rules required for the dimension lifting and partitioning shown above is additionally important. The rules were demonstrated manually. Mechanizing these manipulations is the key to faster computations. Our partial solution to the problem demonstrates that we do not have any performance degradation by supporting shapes in our array class.

+ C B A <PETE> Shape As part of our research, we defined an N-dimensional array class with shape. In order to support the mechanization operations such as these (slide 2). The new array class extends the support for array operations in PETE. The essence of our Array class is the shape notion. The shape notion represents the N-dimensional array shape; the size of each of the Array dimensions. In our class, this is passed in as an STL vector. The array class transforms the underlying linear storage container into an N-dimensional Array representation and facilitates Unary and Binary operations on the shaped Array. The Array class integrates many PETE constructs to facilitate this. PETE is a library that facilitates loop unrolling for computations… which greatly improves speed … as it removes most temporary arrays from the computation. This slide then gives you a glimpse of the non-materialization optimizations facilitated by PETE. If we have an N-dimensional Array computation A+B+C (the result assigned to another like Array) It is represented as a left associative expression tree (or abstract syntax tree) such as this (point to slide). Show how it is evaluated (tree) The nodes of the tree are just templated types … they are not actually evaluated. The actual evaluation is done by the overloaded assignment operator rather than cascading operator overloading. The expression is represented as embedded C++ template types as shown (point to slide). The primary expression, shown in white, is of type Expression, and is comprised of many sub-types. It is essentially a binary node which is compressed of an operation (addition in this case), a reference type with a templated array sub-type, and another binary node which resolves to an array reference type. Thus, it represents the addition of two arrays. The second sub-expression (represented in green, represents the A+B portion of the expression tree. The point of all this is to represent the expression as templated types which are ultimately resolved by the overloaded assignment operator. Normally, c++ resolves these operations at each step, however, PETE forces them to be resolved only when assigned to another array, thus eliminating intermediate storage objects. Array Class: As I said before, our Array class uses these constructs and facilitates these operations on N-dimensional arrays. We tested the performance of our class. A simplified graph of our results is shown above. Essentially, the red line is the performance of hand coded C on the above expression. The black line is the performance of C++ (done normally) on the above expression. The brown line is the performance of our array class on the above expression and the blue line (which almost covers the brown line) is the performance of PETE, with a manual implementation (with out the use of our class) on the expression. It shows that our class performs almost as well as hand coded c, much better than normally implemented c++ and as well as a manual pete implementation. Our experimental results show promising performance on a C++ platform that is more programmable. The benefit here is fast computation with more programmability by using object oriented constructs. This also provides a platform for implementing psi calculus rules to further improve the computational speed (to be done in the future). The Psi calculus rules rewrite the AST often reducing operations to mere index manipulations. As well as facilitating memory and processor mapping. In summary, the system will; allow n-dimensional computations, unroll loops – reducing the number of computations performed, remove temporaries (via pete and psi-ops) and map the problem to memory hierarchies. const Expression <BinaryNode<OpAdd, Reference<Array> , BinaryNode<OpAdd, Reference<Array> , Reference<Array> > > > &expr = A + B + C