University at Albany,SUNY lrm-1 lrm 6/28/2016 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Introduction to arrays
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Why Systolic Architecture ?. Motivation & Introduction We need a high-performance, special-purpose computer system to meet specific application. I/O and.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Chapter 1 Computing Tools Data Representation, Accuracy and Precision Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Data Parallel Algorithms Presented By: M.Mohsin Butt
October 14-15, 2005Conformal Computing Geometry of Arrays: Mathematics of Arrays and  calculus Lenore R. Mullin Computer Science Department College.
The application of Conformal Computing techniques to problems in computational physics: The Fast Fourier Transform James E. Raynolds, College of Nanoscale.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Describing Syntax and Semantics
C++ for Engineers and Scientists Third Edition
Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Mathematical Fundamentals
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.
Chapter 5. Loops are common in most programming languages Plus side: Are very fast (in other languages) & easy to understand Negative side: Require a.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Matlab tutorial course Lesson 2: Arrays and data types
Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
Copyright © 2012 Pearson Education, Inc. Chapter 8 Two Dimensional Arrays.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Generative Programming. Automated Assembly Lines.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
C++ for Engineers and Scientists Second Edition Chapter 11 Arrays.
Lrm-1 lrm 11/15/2015 University at Albany, SUNY Efficient Radar Processing Via Array and Index Algebras Lenore R. Mullin, Daniel J. Rosenkrantz, and Harry.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
XYZ 11/16/2015 MIT Lincoln Laboratory AltiVec Extensions to the Portable Expression Template Engine (PETE)* Edward Rutledge HPEC September,
XYZ 11/21/2015 MIT Lincoln Laboratory Monolithic Compiler Experiments Using C++ Expression Templates* Lenore R. Mullin** Edward Rutledge Robert.
Working with Arrays in MATLAB
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
 2008 Pearson Education, Inc. All rights reserved. 1 Arrays and Vectors.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
1 Design of an MIMD Multimicroprocessor for DSM A Board Which turns PC into a DSM Node Based on the RM Approach 1 The RM approach is essentially a write-through.
Math 252: Math Modeling Eli Goldwyn Introduction to MATLAB.
BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture1.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
University at Albany State University of NY lrm-1 lrm 6/28/16 GE Global Research Simulating Quantum Computation: Essentials for High Performance.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Algorithm Analysis CSE 2011 Winter September 2018.
Chapter 15 QUERY EXECUTION.
Building the Support for Radar Processing Across Memory Hierarchies:
CS/EE 217 – GPU Architecture and Parallel Programming
Monolithic Compiler Experiments Using C++ Expression Templates*
Monolithic Compiler Experiments Using C++ Expression Templates*
Building the Support for Radar Processing Across Memory Hierarchies:
Building the Support for Radar Processing Across Memory Hierarchies:
Building the Support for Radar Processing Across Memory Hierarchies:
(HPEC-LB) Outline Notes
Building the Support for Radar Processing Across Memory Hierarchies:
Building the Support for Radar Processing Across Memory Hierarchies:
Building the Support for Radar Processing Across Memory Hierarchies:
Memory Efficient Radar Processing
<PETE> Shape Programmability
Object - Oriented Programming Language
Presentation transcript:

University at Albany,SUNY lrm-1 lrm 6/28/2016 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional dimension for each level of the hierarchy. –Envision data as reshaped to reflect increased dimensionality. –Calculus automatically transforms algorithm to reflect reshaped data array. –Data, layout, data movement, and scalarization automatically generated based on reshaped data array.

University at Albany,SUNY lrm-2 lrm 6/28/2016 Levels of Processor/Memory Hierarchy continued Math and indexing operations in same expression Framework for design space search –Rigorous and provably correct –Extensible to complex architectures Approach Mathematics of Arrays Example:“raising” array dimensionality y= conv (x)(x) Memory Hierarchy Parallelism Main Memory L2 Cache L1 Cache Map x: Map: P0 P1 P2 P0 P1 P2

University at Albany,SUNY lrm-3 lrm 6/28/2016 Application Domain Signal Processing 3-d Radar Data Processing Composition of Monolithic Array Operations Algorithm is Input Architectural Information is Input Hardware Info: - Memory - Processor Change algorithm to better match hardware/memory/ communication. Lift dimension algebraically Pulse Compression Doppler Filtering BeamformingDetection Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1) Model All Three: dim=dim+3 Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1) Model All Three: dim=dim+3 ConvolutionMatrix Multiply

University at Albany,SUNY lrm-4 lrm 6/28/2016 Some Psi Calculus Operations

University at Albany,SUNY lrm-5 lrm 6/28/2016 Convolution: PSI Calculus Description PSI Calculus operators compose to form higher level operations Definition of y=conv(h,x) y[n]=where x ‘ has N elements, h has M elements, 0≤n<N+M-1, and x’ is x padded by M-1 zeros on either end Algorithm and PSI Calculus Description Algorithm step Psi Calculus sum Y=unaryOmega (sum, 1, Prod) Initial step x= h= rotate x’ (N+M-1) times x’ rot =binaryOmega(rotate,0,iota(N+M-1), 1 x’) Form x’ x’=cat(reshape(, ), cat(x, reshape(, )))= take the size of h part of x’ rot x’ final =binaryOmega(take,0,reshape,,=1,x’ rot multiply Prod=binaryOmega (*,1, h,1,x’ final ) x= h= x’ rot = x’ final = Prod= Y= x’=

University at Albany,SUNY lrm-6 lrm 6/28/2016 Experimental Platform and Method Hardware DY4 CHAMP-AV Board –Contains 4 MPC7400’s and 1 MPC 8420 MPC7400 (G4) –450 MHz –32 KB L1 data cache –2 MB L2 cache –64 MB memory/processor Software VxWorks 5.2 –Real-time OS GCC (non-official release) –GCC with patches for VxWorks –Optimization flags: -O3 -funroll-loops -fstrict-aliasing Method Run many iterations, report average, minimum, maximum time –From 10,000,000 iterations for small data sizes, to 1000 for large data sizes All approaches run on same data Only average times shown here Only one G4 processor used Use of the VxWorks OS resulted in very low variability in timing High degree of confidence in results Use of the VxWorks OS resulted in very low variability in timing High degree of confidence in results

University at Albany,SUNY lrm-7 lrm 6/28/2016 Convolution and Dimension Lifting Model Processor and Level 1 cache. –Start with 1-d inputs(input dimension). –Envision 2 nd dimension ranging over output values. –Envision Processors Reshaped into a 3 rd dimension. The 2 nd dimension is partitioned. –Envision Cache Reshaped into a 4 th dimension. The 1 st dimension is partitioned. –“psi” Reduce to Normal Form

University at Albany,SUNY lrm-8 lrm 6/28/2016 – Envision 2 nd dimension ranging over output values. Let tz=N+M-1 M=  h=4 N=  x tz h3h3 h2h2 h1h1 h0h0 000x0x0  x 4

University at Albany,SUNY lrm-9 lrm 6/28/  x  x tz Envision Processors Reshaped into a 3 rd dimension. The 2 nd dimension is partitioned. Let p =

University at Albany,SUNY lrm-10 lrm 6/28/2016 – Envision Cache Tz/2  x  2 x 2 Reshaped into a 4 th dimension The 1 st dimension is partitioned. tz/2 Tz/2  x  2 x 2 2 tz 2 2 2

University at Albany,SUNY lrm-11 lrm 6/28/2016 ONF for the Convolution Decomposition with Processors & Cache Generic form- 4 dimensional after “psi” Reduction 1.For i 0 = 0 to p-1 do: 2.For i 1 1= 0 to tz/p –1 do: 3.sum  0 4.For icache row = 0 to M/cache -1 do: 5.For i 3 = 0 to cache –1 do: 6.sum  sum + h [(M-((icache row  cache) + i 3 ))-1]  x’[(((tz/p  i 0 )+i 1 ) + icache row  cache) + i 3 )] Let tz=N+M-1 M=  h N=  x Time Domain ProcessorloopProcessorloop TImeloopTImeloop CacheloopCacheloop sum is calculated for each element of y.

University at Albany,SUNY lrm-12 lrm 6/28/2016 Outline Overview Array Algebra: MoA and Index Calculus: Psi Calculus Time Domain Convolution Other algorithms in Radar –Modified Gram-Schmidt QR Decompositions MOA to ONF Experiments –Composition of Matrix Multiplication in Beamforming MoA to DNF Experiments –FFT Benefits of Using Moa and Psi Calculus

University at Albany,SUNY lrm-13 lrm 6/28/2016 ONF for 1 proc Algorithms in Radar Time Domain Convolution (x,y) Modified Gram Schmidt QR (A) A x (B H x C) Beamforming Manual description & derivation for 1 processor DNF Lift dimension - Processor - L1 cache reformulate DNFONF Mechanize Using Expression Templates Use to reason about RAW Benchmark at NCSA w/LAPACK Compiler Optimizations DNF to ONF Implement DNF/ONF Fortran 90 Thoughts on an Abstract Machine MoA &  Calculus

University at Albany,SUNY lrm-14 lrm 6/28/2016 Benefits of Using Moa and Psi Calculus Processor/Memory Hierarchy can be modeled by reshaping data using an extra dimension for each level. Composition of monolithic operations can be reexpressed as composition of operations on smaller data granularities –Matches memory hierarchy levels –Avoids materialization of intermediate arrays. Algorithm can be automatically(algebraically) transformed to reflect array reshapings above. Facilitates programming expressed at a high level –Facilitates intentional program design and analysis –Facilitates portability This approach is applicable to many other problems in radar.

University at Albany,SUNY lrm-15 lrm 6/28/2016 ONF for the QR Decomposition with Processors & Cache Modified Gram Schmidt MainLoopMainLoop Processor Loop Processor Cache Loop Processor Cache Loop Initialization Compute Norm Normalize DoT Product Ortothogonalize

University at Albany,SUNY lrm-16 lrm 6/28/2016 DNF for the Composition of A x (B H x C) Generic form- 4 dimensional 1.Z=0 2.For i=0 to n-1 do: 3.For j=0 to n-1 do: 4.For k=0 to n-1 do: 5.z[k;]  z[k;]+A[k;j]xX[j;i]xB[i;] Given A, B, X, Z n by n arrays Beamforming

University at Albany,SUNY lrm-17 lrm 6/28/2016 Typical C++ Operator Overloading temp B+Ctemp temp copyA Main Operator + Operator = 1. Pass B and C references to operator + 2. Create temporary result vector 3. Calculate results, store in temporary 4.Return copy of temporary 5. Pass results reference to operator= 6. Perform assignment temp copy temp copy & Example: A=B+C vector add B&, C& Additional Memory Use Additional Execution Time Static memory Dynamic memory (also affects execution time) Cache misses/ page faults Time to create a new vector Time to create a copy of a vector Time to destruct both temporaries 2 temporary vectors created

University at Albany,SUNY lrm-18 lrm 6/28/2016 C++ Expression Templates and PETE Parse trees, not vectors, created Reduced Memory Use Reduced Execution Time Parse tree only contains references Better cache use Loop fusion style optimization Compile-time expression tree manipulation PETE: PETE, the Portable Expression Template Engine, is available from the Advanced Computing Laboratory at Los Alamos National Laboratory PETE provides: –Expression template capability –Facilities to help navigate and evaluating parse trees A=B+C BinaryNode, Reference > Expression Templates Expression Expression TypeParse Tree B+CA Main Operator + Operator = + B& C& 1. Pass B and C references to operator + 4. Pass expression tree reference to operator 2. Create expression parse tree 3. Return expression parse tree 5. Calculate result and perform assignment copy & copy B&, C& Parse trees, not vectors, created + B C

University at Albany,SUNY lrm-19 lrm 6/28/2016 Implementing Psi Calculus with Expression Templates Example: A=take(4,drop(3,rev(B))) B= A= Recall: Psi Reduction for 1-d arrays always yields one or more expressions of the form: x[ i ]=y[stride* i + offset] l ≤ i < u 1. Form expression tree 2. Add size information 3. Apply Psi Reduction rules 4. Rewrite as sub-expressions with iterators at the leaves, and loop bounds information at the root take drop rev B 4 3 take drop rev B 4 3 size=4 size=7 size=10 size=4 iterator: offset=6 stride=-1 size=4A[ i ]=B[- i +6] size=7A[ i ]=B[-( i +3)+9] =B[- i +6] size=10A[ i ]=B[- i +B.size-1] =B[- i +9] size=10A[ i ]=B[ i ] Size info Reduction Iterators used for efficiency, rather than recalculating indices for each i One “for” loop to evaluate each sub-expression