Download presentation
Presentation is loading. Please wait.
Published byAlban Hardy Modified over 9 years ago
1
lrm-1 lrm 11/15/2015 University at Albany, SUNY Efficient Radar Processing Via Array and Index Algebras Lenore R. Mullin, Daniel J. Rosenkrantz, and Harry B. Hunt III, Xingmin Luo University at Albany, SUNY NSF CCR 0105536
2
University at Albany,SUNY lrm-2 lrm 11/15/2015 Outline Overview –Motivation Radar Software Processing: to exceed 1 x 10 12 ops/second The Mapping Problem: Efficient Use of Memory Hierarchy, Portable, Scalable, … Radar uses Linear and Multi-linear Operators: Array Based Operations Array Operations Require Array Algebra and Index Calculus Array Algebra: MoA and Index Calculus: Psi Calculus –Reshape to use Processor/Memory Hierarchy Efficiently: Lift Dimension –High-Level Monolithic Operations: Remove Temporaries Time Domain Convolution Benefits of Using MoA and Psi Calculus
3
University at Albany,SUNY lrm-3 lrm 11/15/2015 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional dimension for each level of the hierarchy. –Envision data as reshaped to reflect increased dimensionality. –Calculus automatically transforms algorithm to reflect reshaped data array. –Data, layout, data movement, and scalarization automatically generated based on reshaped data array.
4
University at Albany,SUNY lrm-4 lrm 11/15/2015 Levels of Processor/Memory Hierarchy continued Math and indexing operations in same expression Framework for design space search –Rigorous and provably correct –Extensible to complex architectures Approach Mathematics of Arrays Example:“raising” array dimensionality y= conv (x)(x) Memory Hierarchy Parallelism Main Memory L2 Cache L1 Cache Map x: Map: P0 P1 P2 P0 P1 P2
5
University at Albany,SUNY lrm-5 lrm 11/15/2015 Application Domain Signal Processing 3-d Radar Data Processing Composition of Monolithic Array Operations Algorithm is Input Architectural Information is Input Hardware Info: - Memory - Processor Change algorithm to better match hardware/memory/ communication. Lift dimension algebraically Pulse Compression Doppler Filtering BeamformingDetection Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1) Model All Three: dim=dim+3 Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1) Model All Three: dim=dim+3 ConvolutionMatrix Multiply
6
University at Albany,SUNY lrm-6 lrm 11/15/2015 Current Abstraction Approaches Even when operations compose, they don’t compose, X(YZ) without temporary arrays Even when operations compose, they don’t compose, X(YZ) without temporary arrays Classical Compiler Technology & Optimization Scalable/PortableFine Tune High Performance Blas, Linpack, LAPACK, SCALAPACK ATLAS Libraries PVL, Blitz++, MTL Libraries Loop transformations Theories Grammar Changes Compiler AST Optimizations Grammar Changes Standard Compiler Optimizations Interpreted Fortran 95ZPL C++ w/classes functions, templates MATLAB Some Modern Programming Languages with Monolithic Arrays Requires highly skilled programmers Partial Algebras Compiled PETE AST Preprocessor
7
University at Albany,SUNY lrm-7 lrm 11/15/2015 Outline Overview Array Algebra: MoA and Index Calculus: Psi Calculus –Reshape to use Processor/Memory Hierarchy Efficiently: Lift Dimension –High-Level Monolithic Operations: Remove Temporaries Time Domain Convolution Benefits of Using MoA and Psi Calculus
8
University at Albany,SUNY lrm-8 lrm 11/15/2015 PSI Calculus Basic Properties: Index calculus: Centers around psi function. Shape polymorphic functions and operators: Operations are defined using shapes and psi. Fundamental type is the array modeled as (shape_vector, components). scalars are 0-dimensional arrays, that is: (empty_vector, scalar value). Denotational Normal Form(DNF) = reduced form in Cartesian coordinates (independent of data layout: row major, column major, regular sparse, …) Operational Normal Form(ONF) = reduced form for 1-d memory layout(s).
9
University at Albany,SUNY lrm-9 lrm 11/15/2015 Psi Reduction ONF has minimum number of reads/writes PSI Calculus rules applied mechanically to produce ONF which is easily translated to optimal loop implementation A=cat(rev(B), rev(C)) A[i]=B[B.size-1-i] if 0≤ i < B.size A[i]=C[C.size+B.size-1-i] if B.size ≤ i < B.size+C.size) This becomes by “psi” Reduction
10
University at Albany,SUNY lrm-10 lrm 11/15/2015 Some Psi Calculus Operations
11
University at Albany,SUNY lrm-11 lrm 11/15/2015 Convolution: PSI Calculus Description PSI Calculus operators compose to form higher level operations Definition of y=conv(h,x) y[n]=where x ‘ has N elements, h has M elements, 0≤n<N+M-1, and x’ is x padded by M-1 zeros on either end Algorithm and PSI Calculus Description Algorithm step Psi Calculus sum Y=unaryOmega (sum, 1, Prod) Initial step x= h= rotate x’ (N+M-1) times x’ rot =binaryOmega(rotate,0,iota(N+M-1), 1 x’) Form x’ x’=cat(reshape(, ), cat(x, reshape(, )))= take the size of h part of x’ rot x’ final =binaryOmega(take,0,reshape,,=1,x’ rot multiply Prod=binaryOmega (*,1, h,1,x’ final ) x= h= x’ rot = x’ final = Prod= Y= x’=
12
University at Albany,SUNY lrm-12 lrm 11/15/2015 Experimental Platform and Method Hardware DY4 CHAMP-AV Board –Contains 4 MPC7400’s and 1 MPC 8420 MPC7400 (G4) –450 MHz –32 KB L1 data cache –2 MB L2 cache –64 MB memory/processor Software VxWorks 5.2 –Real-time OS GCC 2.95.4 (non-official release) –GCC 2.95.3 with patches for VxWorks –Optimization flags: -O3 -funroll-loops -fstrict-aliasing Method Run many iterations, report average, minimum, maximum time –From 10,000,000 iterations for small data sizes, to 1000 for large data sizes All approaches run on same data Only average times shown here Only one G4 processor used Use of the VxWorks OS resulted in very low variability in timing High degree of confidence in results Use of the VxWorks OS resulted in very low variability in timing High degree of confidence in results
13
University at Albany,SUNY lrm-13 lrm 11/15/2015 Experiment: Conv(x,h) Cost of temporaries in regular C++ approach more pronounced due to large number of operations Cost of expression tree manipulation also more pronounced Cost of temporaries in regular C++ approach more pronounced due to large number of operations Cost of expression tree manipulation also more pronounced
14
University at Albany,SUNY lrm-14 lrm 11/15/2015 Convolution and Dimension Lifting Model Processor and Level 1 cache. –Start with 1-d inputs(input dimension). –Envision 2 nd dimension ranging over output values. –Envision Processors Reshaped into a 3 rd dimension. The 2 nd dimension is partitioned. –Envision Cache Reshaped into a 4 th dimension. The 1 st dimension is partitioned. –“psi” Reduce to Normal Form
15
University at Albany,SUNY lrm-15 lrm 11/15/2015 – Envision 2 nd dimension ranging over output values. Let tz=N+M-1 M= h=4 N= x tz h3h3 h2h2 h1h1 h0h0 000x0x0 x 4
16
University at Albany,SUNY lrm-16 lrm 11/15/2015 2 x x tz 2 2 - Envision Processors Reshaped into a 3 rd dimension. The 2 nd dimension is partitioned. Let p = 2 -4 -4 -
17
University at Albany,SUNY lrm-17 lrm 11/15/2015 – Envision Cache Tz/2 x 2 x 2 Reshaped into a 4 th dimension The 1 st dimension is partitioned. tz/2 Tz/2 x 2 x 2 2 tz 2 2 2
18
University at Albany,SUNY lrm-18 lrm 11/15/2015 ONF for the Convolution Decomposition with Processors & Cache Generic form- 4 dimensional after “psi” Reduction 1.For i 0 = 0 to p-1 do: 2.For i 1 1= 0 to tz/p –1 do: 3.sum 0 4.For icache row = 0 to M/cache -1 do: 5.For i 3 = 0 to cache –1 do: 6.sum sum + h [(M-((icache row cache) + i 3 ))-1] x’[(((tz/p i 0 )+i 1 ) + icache row cache) + i 3 )] Let tz=N+M-1 M= h N= x Time Domain ProcessorloopProcessorloop TImeloopTImeloop CacheloopCacheloop sum is calculated for each element of y.
19
University at Albany,SUNY lrm-19 lrm 11/15/2015 Outline Overview Array Algebra: MoA and Index Calculus: Psi Calculus Time Domain Convolution Other algorithms in Radar –Modified Gram-Schmidt QR Decompositions MOA to ONF Experiments –Composition of Matrix Multiplication in Beamforming MoA to DNF Experiments –FFT Benefits of Using Moa and Psi Calculus
20
University at Albany,SUNY lrm-20 lrm 11/15/2015 ONF for 1 proc Algorithms in Radar Time Domain Convolution (x,y) Modified Gram Schmidt QR (A) A x (B H x C) Beamforming Manual description & derivation for 1 processor DNF Lift dimension - Processor - L1 cache reformulate DNFONF Mechanize Using Expression Templates Use to reason about RAW Benchmark at NCSA w/LAPACK Compiler Optimizations DNF to ONF Implement DNF/ONF Fortran 90 Thoughts on an Abstract Machine MoA & Calculus
21
University at Albany,SUNY lrm-21 lrm 11/15/2015 ONF for the QR Decomposition with Processors & Cache Modified Gram Schmidt MainLoopMainLoop Processor Loop Processor Cache Loop Processor Cache Loop Initialization Compute Norm Normalize DoT Product Ortothogonalize
22
University at Albany,SUNY lrm-22 lrm 11/15/2015 DNF for the Composition of A x (B H x C) Generic form- 4 dimensional 1.Z=0 2.For i=0 to n-1 do: 3.For j=0 to n-1 do: 4.For k=0 to n-1 do: 5.z[k;] z[k;]+A[k;j]xX[j;i]xB[i;] Given A, B, X, Z n by n arrays Beamforming
23
University at Albany,SUNY lrm-23 lrm 11/15/2015 Fftpsirad2: Performance Comparisons
24
University at Albany,SUNY lrm-24 lrm 11/15/2015 Mechanizing MoA and Psi Reduction MoA & calculus theory: Mullin ’88 Prototype compiler: output C, F90, HPF: Mullin and Thibault’94 HPF compiler: AST manipulations: Mullin, et al ‘96 SAC: functional C: Mullin and Bodo’96 C++ classes: Helal, Sameh and Mullin’01 C++ expression templates: Mullin, Ruttledge, Bond’02 PVL with the portable expression template engine(PETE) Parallel and distributed processing Abstract machine Automate cost and determine optimizations: minimize search space Lifting Compiler Optimizations to Application Programmer Interface Theory applied to embedded systems C++ C Fortran Index Theory Introduced Abrams 1972
25
University at Albany,SUNY lrm-25 lrm 11/15/2015 On-going research we are implementing the psi calculus using expression templates. we are building on work done at MIT and we are working with MTL library developers (lumsdaine) at Indiana University and STL library developer, musser, at rpi.
26
University at Albany,SUNY lrm-26 lrm 11/15/2015 Benefits of Using Moa and Psi Calculus Processor/Memory Hierarchy can be modeled by reshaping data using an extra dimension for each level. Composition of monolithic operations can be reexpressed as composition of operations on smaller data granularities –Matches memory hierarchy levels –Avoids materialization of intermediate arrays. Algorithm can be automatically(algebraically) transformed to reflect array reshapings above. Facilitates programming expressed at a high level –Facilitates intentional program design and analysis –Facilitates portability This approach is applicable to many other problems in radar.
27
University at Albany,SUNY lrm-27 lrm 11/15/2015 Email and Question? Lenore R. Mullin, lenore@cs.albany.edulenore@cs.albany.edu Daniel J. Rosenkrantz, djr@cs.albany.edudjr@cs.albany.edu Harry B. Hunt III, hunt@cs.albany.eduhunt@cs.albany.edu Xingmin Luo, xluo@cs.albany.eduxluo@cs.albany.edu *The End*
28
University at Albany,SUNY lrm-28 lrm 11/15/2015 Typical C++ Operator Overloading temp B+Ctemp temp copyA Main Operator + Operator = 1. Pass B and C references to operator + 2. Create temporary result vector 3. Calculate results, store in temporary 4.Return copy of temporary 5. Pass results reference to operator= 6. Perform assignment temp copy temp copy & Example: A=B+C vector add B&, C& Additional Memory Use Additional Execution Time Static memory Dynamic memory (also affects execution time) Cache misses/ page faults Time to create a new vector Time to create a copy of a vector Time to destruct both temporaries 2 temporary vectors created
29
University at Albany,SUNY lrm-29 lrm 11/15/2015 C++ Expression Templates and PETE Parse trees, not vectors, created Reduced Memory Use Reduced Execution Time Parse tree only contains references Better cache use Loop fusion style optimization Compile-time expression tree manipulation PETE: http://www.acl.lanl.gov/pete PETE, the Portable Expression Template Engine, is available from the Advanced Computing Laboratory at Los Alamos National Laboratory PETE provides: –Expression template capability –Facilities to help navigate and evaluating parse trees A=B+C BinaryNode, Reference > Expression Templates Expression Expression TypeParse Tree B+CA Main Operator + Operator = + B& C& 1. Pass B and C references to operator + 4. Pass expression tree reference to operator 2. Create expression parse tree 3. Return expression parse tree 5. Calculate result and perform assignment copy & copy B&, C& Parse trees, not vectors, created + B C
30
University at Albany,SUNY lrm-30 lrm 11/15/2015 Implementing Psi Calculus with Expression Templates Example: A=take(4,drop(3,rev(B))) B= A= Recall: Psi Reduction for 1-d arrays always yields one or more expressions of the form: x[ i ]=y[stride* i + offset] l ≤ i < u 1. Form expression tree 2. Add size information 3. Apply Psi Reduction rules 4. Rewrite as sub-expressions with iterators at the leaves, and loop bounds information at the root take drop rev B 4 3 take drop rev B 4 3 size=4 size=7 size=10 size=4 iterator: offset=6 stride=-1 size=4A[ i ]=B[- i +6] size=7A[ i ]=B[-( i +3)+9] =B[- i +6] size=10A[ i ]=B[- i +B.size-1] =B[- i +9] size=10A[ i ]=B[ i ] Size info Reduction Iterators used for efficiency, rather than recalculating indices for each i One “for” loop to evaluate each sub-expression
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.