Lrm-1 lrm 11/15/2015 University at Albany, SUNY Efficient Radar Processing Via Array and Index Algebras Lenore R. Mullin, Daniel J. Rosenkrantz, and Harry.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Advertisements

Introduction to arrays
Lect.3 Modeling in The Time Domain Basil Hamed
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Why Systolic Architecture ?. Motivation & Introduction We need a high-performance, special-purpose computer system to meet specific application. I/O and.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Chapter 8 and 9 Review: Logical Functions and Control Structures Introduction to MATLAB 7 Engineering 161.
Reconfigurable Application Specific Computers RASCs Advanced Architectures with Multiple Processors and Field Programmable Gate Arrays FPGAs Computational.
October 14-15, 2005Conformal Computing Geometry of Arrays: Mathematics of Arrays and  calculus Lenore R. Mullin Computer Science Department College.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
The application of Conformal Computing techniques to problems in computational physics: The Fast Fourier Transform James E. Raynolds, College of Nanoscale.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
Describing Syntax and Semantics
C++ for Engineers and Scientists Third Edition
Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.
2IV60 Computer Graphics Basic Math for CG
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Mathematical Fundamentals
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
1 ENERGY 211 / CME 211 Lecture 26 November 19, 2008.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
Generative Programming. Automated Assembly Lines.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
XYZ 11/16/2015 MIT Lincoln Laboratory AltiVec Extensions to the Portable Expression Template Engine (PETE)* Edward Rutledge HPEC September,
XYZ 11/21/2015 MIT Lincoln Laboratory Monolithic Compiler Experiments Using C++ Expression Templates* Lenore R. Mullin** Edward Rutledge Robert.
Working with Arrays in MATLAB
Overview of Previous Lesson(s) Over View  A program must be translated into a form in which it can be executed by a computer.  The software systems.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
Math 252: Math Modeling Eli Goldwyn Introduction to MATLAB.
Parallel Computing Presented by Justin Reschke
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
University at Albany State University of NY lrm-1 lrm 6/28/16 GE Global Research Simulating Quantum Computation: Essentials for High Performance.
University at Albany,SUNY lrm-1 lrm 6/28/2016 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional.
In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU
COMP 175: Computer Graphics February 9, 2016
Building the Support for Radar Processing Across Memory Hierarchies:
Monolithic Compiler Experiments Using C++ Expression Templates*
Monolithic Compiler Experiments Using C++ Expression Templates*
Building the Support for Radar Processing Across Memory Hierarchies:
Building the Support for Radar Processing Across Memory Hierarchies:
Building the Support for Radar Processing Across Memory Hierarchies:
(HPEC-LB) Outline Notes
Building the Support for Radar Processing Across Memory Hierarchies:
Building the Support for Radar Processing Across Memory Hierarchies:
Building the Support for Radar Processing Across Memory Hierarchies:
Memory Efficient Radar Processing
<PETE> Shape Programmability
Presentation transcript:

lrm-1 lrm 11/15/2015 University at Albany, SUNY Efficient Radar Processing Via Array and Index Algebras Lenore R. Mullin, Daniel J. Rosenkrantz, and Harry B. Hunt III, Xingmin Luo University at Albany, SUNY NSF CCR

University at Albany,SUNY lrm-2 lrm 11/15/2015 Outline Overview –Motivation Radar Software Processing: to exceed 1 x ops/second The Mapping Problem: Efficient Use of Memory Hierarchy, Portable, Scalable, … Radar uses Linear and Multi-linear Operators: Array Based Operations Array Operations Require Array Algebra and Index Calculus Array Algebra: MoA and Index Calculus: Psi Calculus –Reshape to use Processor/Memory Hierarchy Efficiently: Lift Dimension –High-Level Monolithic Operations: Remove Temporaries Time Domain Convolution Benefits of Using MoA and Psi Calculus

University at Albany,SUNY lrm-3 lrm 11/15/2015 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional dimension for each level of the hierarchy. –Envision data as reshaped to reflect increased dimensionality. –Calculus automatically transforms algorithm to reflect reshaped data array. –Data, layout, data movement, and scalarization automatically generated based on reshaped data array.

University at Albany,SUNY lrm-4 lrm 11/15/2015 Levels of Processor/Memory Hierarchy continued Math and indexing operations in same expression Framework for design space search –Rigorous and provably correct –Extensible to complex architectures Approach Mathematics of Arrays Example:“raising” array dimensionality y= conv (x)(x) Memory Hierarchy Parallelism Main Memory L2 Cache L1 Cache Map x: Map: P0 P1 P2 P0 P1 P2

University at Albany,SUNY lrm-5 lrm 11/15/2015 Application Domain Signal Processing 3-d Radar Data Processing Composition of Monolithic Array Operations Algorithm is Input Architectural Information is Input Hardware Info: - Memory - Processor Change algorithm to better match hardware/memory/ communication. Lift dimension algebraically Pulse Compression Doppler Filtering BeamformingDetection Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1) Model All Three: dim=dim+3 Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1) Model All Three: dim=dim+3 ConvolutionMatrix Multiply

University at Albany,SUNY lrm-6 lrm 11/15/2015 Current Abstraction Approaches Even when operations compose, they don’t compose, X(YZ) without temporary arrays Even when operations compose, they don’t compose, X(YZ) without temporary arrays Classical Compiler Technology & Optimization Scalable/PortableFine Tune High Performance Blas, Linpack, LAPACK, SCALAPACK ATLAS Libraries PVL, Blitz++, MTL Libraries Loop transformations Theories Grammar Changes Compiler AST Optimizations Grammar Changes Standard Compiler Optimizations Interpreted Fortran 95ZPL C++ w/classes functions, templates MATLAB Some Modern Programming Languages with Monolithic Arrays Requires highly skilled programmers Partial Algebras Compiled PETE AST Preprocessor

University at Albany,SUNY lrm-7 lrm 11/15/2015 Outline Overview Array Algebra: MoA and Index Calculus: Psi Calculus –Reshape to use Processor/Memory Hierarchy Efficiently: Lift Dimension –High-Level Monolithic Operations: Remove Temporaries Time Domain Convolution Benefits of Using MoA and Psi Calculus

University at Albany,SUNY lrm-8 lrm 11/15/2015 PSI Calculus Basic Properties: Index calculus: Centers around psi function. Shape polymorphic functions and operators: Operations are defined using shapes and psi. Fundamental type is the array modeled as (shape_vector, components). scalars are 0-dimensional arrays, that is: (empty_vector, scalar value). Denotational Normal Form(DNF) = reduced form in Cartesian coordinates (independent of data layout: row major, column major, regular sparse, …) Operational Normal Form(ONF) = reduced form for 1-d memory layout(s).

University at Albany,SUNY lrm-9 lrm 11/15/2015 Psi Reduction ONF has minimum number of reads/writes PSI Calculus rules applied mechanically to produce ONF which is easily translated to optimal loop implementation A=cat(rev(B), rev(C))  A[i]=B[B.size-1-i] if 0≤ i < B.size A[i]=C[C.size+B.size-1-i] if B.size ≤ i < B.size+C.size) This becomes by “psi” Reduction

University at Albany,SUNY lrm-10 lrm 11/15/2015 Some Psi Calculus Operations

University at Albany,SUNY lrm-11 lrm 11/15/2015 Convolution: PSI Calculus Description PSI Calculus operators compose to form higher level operations Definition of y=conv(h,x) y[n]=where x ‘ has N elements, h has M elements, 0≤n<N+M-1, and x’ is x padded by M-1 zeros on either end Algorithm and PSI Calculus Description Algorithm step Psi Calculus sum Y=unaryOmega (sum, 1, Prod) Initial step x= h= rotate x’ (N+M-1) times x’ rot =binaryOmega(rotate,0,iota(N+M-1), 1 x’) Form x’ x’=cat(reshape(, ), cat(x, reshape(, )))= take the size of h part of x’ rot x’ final =binaryOmega(take,0,reshape,,=1,x’ rot multiply Prod=binaryOmega (*,1, h,1,x’ final ) x= h= x’ rot = x’ final = Prod= Y= x’=

University at Albany,SUNY lrm-12 lrm 11/15/2015 Experimental Platform and Method Hardware DY4 CHAMP-AV Board –Contains 4 MPC7400’s and 1 MPC 8420 MPC7400 (G4) –450 MHz –32 KB L1 data cache –2 MB L2 cache –64 MB memory/processor Software VxWorks 5.2 –Real-time OS GCC (non-official release) –GCC with patches for VxWorks –Optimization flags: -O3 -funroll-loops -fstrict-aliasing Method Run many iterations, report average, minimum, maximum time –From 10,000,000 iterations for small data sizes, to 1000 for large data sizes All approaches run on same data Only average times shown here Only one G4 processor used Use of the VxWorks OS resulted in very low variability in timing High degree of confidence in results Use of the VxWorks OS resulted in very low variability in timing High degree of confidence in results

University at Albany,SUNY lrm-13 lrm 11/15/2015 Experiment: Conv(x,h) Cost of temporaries in regular C++ approach more pronounced due to large number of operations Cost of expression tree manipulation also more pronounced Cost of temporaries in regular C++ approach more pronounced due to large number of operations Cost of expression tree manipulation also more pronounced

University at Albany,SUNY lrm-14 lrm 11/15/2015 Convolution and Dimension Lifting Model Processor and Level 1 cache. –Start with 1-d inputs(input dimension). –Envision 2 nd dimension ranging over output values. –Envision Processors Reshaped into a 3 rd dimension. The 2 nd dimension is partitioned. –Envision Cache Reshaped into a 4 th dimension. The 1 st dimension is partitioned. –“psi” Reduce to Normal Form

University at Albany,SUNY lrm-15 lrm 11/15/2015 – Envision 2 nd dimension ranging over output values. Let tz=N+M-1 M=  h=4 N=  x tz h3h3 h2h2 h1h1 h0h0 000x0x0  x 4

University at Albany,SUNY lrm-16 lrm 11/15/  x  x tz Envision Processors Reshaped into a 3 rd dimension. The 2 nd dimension is partitioned. Let p =

University at Albany,SUNY lrm-17 lrm 11/15/2015 – Envision Cache Tz/2  x  2 x 2 Reshaped into a 4 th dimension The 1 st dimension is partitioned. tz/2 Tz/2  x  2 x 2 2 tz 2 2 2

University at Albany,SUNY lrm-18 lrm 11/15/2015 ONF for the Convolution Decomposition with Processors & Cache Generic form- 4 dimensional after “psi” Reduction 1.For i 0 = 0 to p-1 do: 2.For i 1 1= 0 to tz/p –1 do: 3.sum  0 4.For icache row = 0 to M/cache -1 do: 5.For i 3 = 0 to cache –1 do: 6.sum  sum + h [(M-((icache row  cache) + i 3 ))-1]  x’[(((tz/p  i 0 )+i 1 ) + icache row  cache) + i 3 )] Let tz=N+M-1 M=  h N=  x Time Domain ProcessorloopProcessorloop TImeloopTImeloop CacheloopCacheloop sum is calculated for each element of y.

University at Albany,SUNY lrm-19 lrm 11/15/2015 Outline Overview Array Algebra: MoA and Index Calculus: Psi Calculus Time Domain Convolution Other algorithms in Radar –Modified Gram-Schmidt QR Decompositions MOA to ONF Experiments –Composition of Matrix Multiplication in Beamforming MoA to DNF Experiments –FFT Benefits of Using Moa and Psi Calculus

University at Albany,SUNY lrm-20 lrm 11/15/2015 ONF for 1 proc Algorithms in Radar Time Domain Convolution (x,y) Modified Gram Schmidt QR (A) A x (B H x C) Beamforming Manual description & derivation for 1 processor DNF Lift dimension - Processor - L1 cache reformulate DNFONF Mechanize Using Expression Templates Use to reason about RAW Benchmark at NCSA w/LAPACK Compiler Optimizations DNF to ONF Implement DNF/ONF Fortran 90 Thoughts on an Abstract Machine MoA &  Calculus

University at Albany,SUNY lrm-21 lrm 11/15/2015 ONF for the QR Decomposition with Processors & Cache Modified Gram Schmidt MainLoopMainLoop Processor Loop Processor Cache Loop Processor Cache Loop Initialization Compute Norm Normalize DoT Product Ortothogonalize

University at Albany,SUNY lrm-22 lrm 11/15/2015 DNF for the Composition of A x (B H x C) Generic form- 4 dimensional 1.Z=0 2.For i=0 to n-1 do: 3.For j=0 to n-1 do: 4.For k=0 to n-1 do: 5.z[k;]  z[k;]+A[k;j]xX[j;i]xB[i;] Given A, B, X, Z n by n arrays Beamforming

University at Albany,SUNY lrm-23 lrm 11/15/2015 Fftpsirad2: Performance Comparisons

University at Albany,SUNY lrm-24 lrm 11/15/2015 Mechanizing MoA and Psi Reduction MoA &  calculus theory: Mullin ’88 Prototype compiler: output C, F90, HPF: Mullin and Thibault’94 HPF compiler: AST manipulations: Mullin, et al ‘96 SAC: functional C: Mullin and Bodo’96 C++ classes: Helal, Sameh and Mullin’01 C++ expression templates: Mullin, Ruttledge, Bond’02 PVL with the portable expression template engine(PETE) Parallel and distributed processing Abstract machine Automate cost and determine optimizations: minimize search space Lifting Compiler Optimizations to Application Programmer Interface Theory applied to embedded systems C++ C Fortran Index Theory Introduced Abrams 1972

University at Albany,SUNY lrm-25 lrm 11/15/2015 On-going research we are implementing the psi calculus using expression templates. we are building on work done at MIT and we are working with MTL library developers (lumsdaine) at Indiana University and STL library developer, musser, at rpi.

University at Albany,SUNY lrm-26 lrm 11/15/2015 Benefits of Using Moa and Psi Calculus Processor/Memory Hierarchy can be modeled by reshaping data using an extra dimension for each level. Composition of monolithic operations can be reexpressed as composition of operations on smaller data granularities –Matches memory hierarchy levels –Avoids materialization of intermediate arrays. Algorithm can be automatically(algebraically) transformed to reflect array reshapings above. Facilitates programming expressed at a high level –Facilitates intentional program design and analysis –Facilitates portability This approach is applicable to many other problems in radar.

University at Albany,SUNY lrm-27 lrm 11/15/2015 and Question? Lenore R. Mullin, Daniel J. Rosenkrantz, Harry B. Hunt III, Xingmin Luo, *The End*

University at Albany,SUNY lrm-28 lrm 11/15/2015 Typical C++ Operator Overloading temp B+Ctemp temp copyA Main Operator + Operator = 1. Pass B and C references to operator + 2. Create temporary result vector 3. Calculate results, store in temporary 4.Return copy of temporary 5. Pass results reference to operator= 6. Perform assignment temp copy temp copy & Example: A=B+C vector add B&, C& Additional Memory Use Additional Execution Time Static memory Dynamic memory (also affects execution time) Cache misses/ page faults Time to create a new vector Time to create a copy of a vector Time to destruct both temporaries 2 temporary vectors created

University at Albany,SUNY lrm-29 lrm 11/15/2015 C++ Expression Templates and PETE Parse trees, not vectors, created Reduced Memory Use Reduced Execution Time Parse tree only contains references Better cache use Loop fusion style optimization Compile-time expression tree manipulation PETE: PETE, the Portable Expression Template Engine, is available from the Advanced Computing Laboratory at Los Alamos National Laboratory PETE provides: –Expression template capability –Facilities to help navigate and evaluating parse trees A=B+C BinaryNode, Reference > Expression Templates Expression Expression TypeParse Tree B+CA Main Operator + Operator = + B& C& 1. Pass B and C references to operator + 4. Pass expression tree reference to operator 2. Create expression parse tree 3. Return expression parse tree 5. Calculate result and perform assignment copy & copy B&, C& Parse trees, not vectors, created + B C

University at Albany,SUNY lrm-30 lrm 11/15/2015 Implementing Psi Calculus with Expression Templates Example: A=take(4,drop(3,rev(B))) B= A= Recall: Psi Reduction for 1-d arrays always yields one or more expressions of the form: x[ i ]=y[stride* i + offset] l ≤ i < u 1. Form expression tree 2. Add size information 3. Apply Psi Reduction rules 4. Rewrite as sub-expressions with iterators at the leaves, and loop bounds information at the root take drop rev B 4 3 take drop rev B 4 3 size=4 size=7 size=10 size=4 iterator: offset=6 stride=-1 size=4A[ i ]=B[- i +6] size=7A[ i ]=B[-( i +3)+9] =B[- i +6] size=10A[ i ]=B[- i +B.size-1] =B[- i +9] size=10A[ i ]=B[ i ] Size info Reduction Iterators used for efficiency, rather than recalculating indices for each i One “for” loop to evaluate each sub-expression