999999-1 XYZ 11/16/2015 MIT Lincoln Laboratory AltiVec Extensions to the Portable Expression Template Engine (PETE)* Edward Rutledge HPEC 2002 25 September,

Slides:



Advertisements
Similar presentations
Chapt.2 Machine Architecture Impact of languages –Support – faster, more secure Primitive Operations –e.g. nested subroutine calls »Subroutines implemented.
Advertisements

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Xtensa C and C++ Compiler Ding-Kai Chen
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Chapter 10- Instruction set architectures
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Performance Analysis of Multiprocessor Architectures
K-means clustering –An unsupervised and iterative clustering algorithm –Clusters N observations into K clusters –Observations assigned to cluster with.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Pointer. Warning! Dangerous Curves C (and C++) have just about the most powerful, flexible and dangerous pointers in the world. –Most other languages.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
1 Key Concepts:  Why C?  Life Cycle Of a C program,  What is a computer program?  A program statement?  Basic parts of a C program,  Printf() function?
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
System Calls 1.
Chapter 1 Algorithm Analysis
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
CIS Computer Programming Logic
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Computing with C# and the.NET Framework Chapter 1 An Introduction to Computing with C# ©2003, 2011 Art Gittleman.
MIT Lincoln Laboratory hch-1 HCH 5/26/2016 Achieving Portable Task and Data Parallelism on Parallel Signal Processing Architectures Hank Hoffmann.
Variables and Objects, pointers and addresses: Chapter 3, Slide 1 variables and data objects are data containers with names the value of the variable is.
Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Cpt S 122 – Data Structures Standard Template Library (STL)
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
Chapter 18 Object Database Management Systems. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Motivation for object.
Cousins HPEC 2002 Session 4: Emerging High Performance Software David Cousins Division Scientist High Performance Computing Dept. Newport,
Lists II. List ADT When using an array-based implementation of the List ADT we encounter two problems; 1. Overflow 2. Wasted Space These limitations are.
Lrm-1 lrm 11/15/2015 University at Albany, SUNY Efficient Radar Processing Via Array and Index Algebras Lenore R. Mullin, Daniel J. Rosenkrantz, and Harry.
Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.
Performance Optimization Getting your programs to run faster.
XYZ 11/21/2015 MIT Lincoln Laboratory Monolithic Compiler Experiments Using C++ Expression Templates* Lenore R. Mullin** Edward Rutledge Robert.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
ITCS 3181 Logic and Computer Systems 2015 B. Wilkinson Slides4-2.ppt Modification date: March 23, Procedures Essential ingredient of high level.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Optimization of C Code The C for Speed
1 The Instruction Set Architecture September 27 th, 2007 By: Corbin Johnson CS 146.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Chapter 18 Object Database Management Systems. Outline Motivation for object database management Object-oriented principles Architectures for object database.
ISA's, Compilers, and Assembly
Memory-Aware Compilation Philip Sweany 10/20/2011.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
MIT Lincoln Laboratory 1 of 4 MAA 3/8/2016 Development of a Real-Time Parallel UHF SAR Image Processor Matthew Alexander, Michael Vai, Thomas Emberley,
Ada, Scheme, R Emory Wingard. Ada History Department of Defense in search of high level language around Requirements drafted for the language.
7-Nov Fall 2001: copyright ©T. Pearce, D. Hutchinson, L. Marshall Oct lecture23-24-hll-interrupts 1 High Level Language vs. Assembly.
University at Albany,SUNY lrm-1 lrm 6/28/2016 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional.
HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.
Chapter 1 Introduction.
M. Richards1 ,D. Campbell1 (presenter), R. Judd2, J. Lebak3, and R
A Closer Look at Instruction Set Architectures
C# and the .NET Framework
Chapter 1 Introduction.
Exploiting Parallelism
Outline Notes (not in presentation)
Object - Oriented Programming Language
Monolithic Compiler Experiments Using C++ Expression Templates*
Monolithic Compiler Experiments Using C++ Expression Templates*
Siddhartha Chatterjee
(HPEC-LB) Outline Notes
UNIT V Run Time Environments.
Memory Efficient Radar Processing
<PETE> Shape Programmability
Outline Notes (not in presentation)
Object - Oriented Programming Language
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Memory System Performance Chapter 3
Object - Oriented Programming Language
Presentation transcript:

XYZ 11/16/2015 MIT Lincoln Laboratory AltiVec Extensions to the Portable Expression Template Engine (PETE)* Edward Rutledge HPEC September, 2002 Lexington, MA * This work is sponsored by the US Navy, under Air Force Contract F C Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the Department of Defense.

MIT Lincoln Laboratory er-2 KAM 11/16/2015 Outline Overview –Motivation for using C++ –Expression Templates and PETE –AltiVec Combining PETE and AltiVec Experiments Future Work and Conclusions

MIT Lincoln Laboratory er-3 KAM 11/16/2015 Programming Modern Signal Processors Hand coded loopC (e.g. VSIPL)C++ (with PETE) for (i = 0; i < ROWS; i++) for (j = 0; j < COLS; j++) a[i][j] = b[i][j] + c[i][j]; vsip_madd_f(b, c, a);a = b + c; LowHigh LowMediumHigh MediumHigh! (with PETE) Portability Productivity Performance Challenge: Translate high-level statements to architecture-specific implementations (e.g. use AltiVec C language extensions) Challenge: Translate high-level statements to architecture-specific implementations (e.g. use AltiVec C language extensions) Software Technologies Performance Portability Productivity Software Goals

MIT Lincoln Laboratory er-4 KAM 11/16/2015 Typical C++ Operator Overloading temp B+Ctemp temp copyA Main Operator + Operator = 1. Pass B and C references to operator + 2. Create temporary result vector 3. Calculate results, store in temporary 4.Return copy of temporary 5. Pass results reference to operator= 6. Perform assignment temp copy temp copy & Example: A=B+C vector add B&, C& Additional Memory Use Additional Execution Time Static memory Dynamic memory (also affects execution time) Cache misses/ page faults Time to create a new vector Time to create a copy of a vector Time to destruct both temporaries 2 temporary vectors created

MIT Lincoln Laboratory er-5 KAM 11/16/2015 C++ Expression Templates and PETE Parse trees, not vectors, created Reduced Memory Use Reduced Execution Time Parse tree only contains references Better cache use Loop fusion style optimization Compile-time expression tree manipulation PETE: PETE, the Portable Expression Template Engine, is available from the Advanced Computing Laboratory at Los Alamos National Laboratory PETE provides: –Expression template capability –Facilities to help navigate and evaluating parse trees A=B+C BinaryNode, Reference > Expression Templates Expression Expression TypeParse Tree B+CA Main Operator + Operator = + B& C& 1. Pass B and C references to operator + 4. Pass expression tree reference to operator 2. Create expression parse tree 3. Return expression parse tree 5. Calculate result and perform assignment copy & copy B&, C& Parse trees, not vectors, created + B C

MIT Lincoln Laboratory er-6 KAM 11/16/2015 AltiVec Overview Altivec Architecture –SIMD extensions to PowerPC (PPC) architecture –Uses 128-bit “vectors” == 4 32-bit floats –API allows programmers to directly insert Altivec code into programs –Theoretical max FLOP rate: AltiVec C/C++ language extensions –New “vector” keyword for new types –New operators for use on vector types –Vector types must be 16 byte aligned –Can cast from native “C” to vector type and vice versa /clock cycle Example: a=b+c int i ; vector float *avec, *bvec, *cvec; avec=(vector float*)a; bvec=(vector float*)b; cvec=(vector float*)c; for ( i =0; i < VEC_SIZE/4; i ++) *avec++=vec_add(*bvec++,*cvec++);

MIT Lincoln Laboratory er-7 KAM 11/16/2015 AltiVec Performance Issues Key to good performance: avoid frequent loads/stores PETE helps by keeping intermediate results in registers Key to good performance: avoid frequent loads/stores PETE helps by keeping intermediate results in registers System Example: DY4 CHAMP-AV board Memory Hierarchy G4 Processor (400MHz= 3.2 GFLOPS/sec) L1 Cache (32 KB data) L2 Cache (2 MB) Main Memory (64 MB) 5.5 GB/Sec=1.38 Gfloats/sec 1.1 GB/Sec=275 Mfloats/sec 112 MB/Sec=28 Mfloats/sec Bottleneck at every level of memory hierarchy Bottleneck more pronounced lower in the hierarchy Memory Bottleneck Measured Bandwidth

MIT Lincoln Laboratory er-8 KAM 11/16/2015 Outline Overview –Motivation for using C++ –Expression Templates and PETE –AltiVec Combining PETE and AltiVec Experiments Future Work and Conclusions

MIT Lincoln Laboratory er-9 KAM 11/16/2015 PETE: A Closer Look Vector Operator = *it=forEach(expr, DereferenceLeaf(), OpCombine() ); forEach (expr, IncrementLeaf(), NullCombine() ); *it= *bIt + *cIt * *dIt; bIt++; cIt++; dIt++; it=begin (); for (int i=0; i<size(); i++) { it++; } PETE ForEach: Recursive descent traversal of expression - User defines action performed at leaves - User defines action performed at internal nodes Increment Leaf Dereference Leaf OpAdd + OpCombine Action performed at internal nodes Action performed at leaves BinaryNode > + B C * A= D A=B+C*D Step 1: Form expression User specifies what to store at the leaves Step 2: Evaluate expression Translation at compile time - Template specialization - Inlining

MIT Lincoln Laboratory er-10 KAM 11/16/2015 PETE: Adding AltiVec Vector Operator = *it=forEach(expr, DereferenceLeaf(), OpCombine() ); forEach (expr, IncrementLeaf(), NullCombine() ); *it= *bIt + *cIt * *dIt; bIt++; cIt++; dIt++; it=begin (); for (int i=0; i<size(); i++) { it++; } PETE ForEach: Recursive descent traversal of expression - User defines action performed at leaves - User defines action performed at internal nodes Increment Leaf Dereference Leaf OpAdd + OpCombine BinaryNode > + B C * A= D A=B+C*D Step 1: Form expression Step 2: Evaluate expression Translation at compile time - Template specialization - Inlining

MIT Lincoln Laboratory er-11 KAM 11/16/2015 PETE: Adding AltiVec PETE ForEach: Recursive descent traversal of expression - User defines action performed at leaves - User defines action performed at internal nodes Step 1: Form expression TrinaryNode > A=B+C*D Step 2: Evaluate expression mulAdd C D A= B Multiply-add produces trinary node Translation at compile time - Template specialization - Inlining Vector Operator = *it=forEach(expr, DereferenceLeaf(), OpCombine() ); forEach (expr, IncrementLeaf(), NullCombine() ); *it= *bIt + *cIt * *dIt; bIt++; cIt++; dIt++; it=begin (); for (int i=0; i<size(); i++) { it++; } Increment Leaf Dereference Leaf OpAdd + OpCombine

MIT Lincoln Laboratory er-12 KAM 11/16/2015 PETE: Adding AltiVec PETE ForEach: Recursive descent traversal of expression - User defines action performed at leaves - User defines action performed at internal nodes Step 1: Form expression TrinaryNode > A=B+C*D Step 2: Evaluate expression mulAdd C D A= B Multiply-add produces trinary node Translation at compile time - Template specialization - Inlining Vector Operator = *it=forEach(expr, DereferenceLeaf(), OpCombine() ); forEach (expr, IncrementLeaf(), NullCombine() ); *it= *bIt + *cIt * *dIt; bIt++; cIt++; dIt++; it=begin (); for (int i=0; i<size(); i++) { it++; } Increment Leaf Dereference Leaf OpAdd + OpCombine vector float* instead of float* at leaves

MIT Lincoln Laboratory er-13 KAM 11/16/2015 PETE: Adding AltiVec *it=forEach(expr, DereferenceLeaf(), OpCombine() ); forEach (expr, IncrementLeaf(), NullCombine() ); *it= *bIt + *cIt * *dIt; bIt++; cIt++; dIt++; it=begin (); for (int i=0; i<size(); i++) { it++; } PETE ForEach: Recursive descent traversal of expression - User defines action performed at leaves - User defines action performed at internal nodes Increment Leaf Dereference Leaf OpAdd + OpCombine Translation at compile time - Template specialization - Inlining Step 1: Form expression Vector Operator = Step 2: Evaluate expression TrinaryNode > A=B+C*D mulAdd C D A= B Multiply-add produces trinary node vector float* instead of float* at leaves

MIT Lincoln Laboratory er-14 KAM 11/16/2015 PETE: Adding AltiVec Vector Operator = *it=forEach(expr, DereferenceLeaf(), OpCombine() ); forEach (expr, IncrementLeaf(), NullCombine() ); it=(vector float*)begin(); for (int i=0; i<size()/4; i++) { it++; } TrinaryNode > A=B+C*D Step 1: Form expression Step 2: Evaluate expression mulAdd C D A= B Multiply-add produces trinary node Iterate over vectors PETE ForEach: Recursive descent traversal of expression - User defines action performed at leaves - User defines action performed at internal nodes vector float* instead of float* at leaves *it= *bIt + *cIt * *dIt; bIt++; cIt++; dIt++; Increment Leaf Dereference Leaf OpAdd + OpCombine Translation at compile time - Template specialization - Inlining

MIT Lincoln Laboratory er-15 KAM 11/16/2015 *it= vec_madd (*cIt, *dIt, *bIt); bIt++; cIt++; dIt++; PETE: Adding AltiVec Vector Operator = *it=forEach(expr, DereferenceLeaf(), OpCombine() ); forEach (expr, IncrementLeaf(), NullCombine() ); it=(vector float*)begin(); for (int i=0; i<size()/4; i++) { it++; } Increment Leaf Dereference Leaf OpMulAdd + OpCombine TrinaryNode > A=B+C*D Step 1: Form expression Step 2: Evaluate expression mulAdd C D A= B New rules for internal nodes Multiply-add produces trinary node Iterate over vectors PETE ForEach: Recursive descent traversal of expression - User defines action performed at leaves - User defines action performed at internal nodes Translation at compile time - Template specialization - Inlining vector float* instead of float* at leaves

MIT Lincoln Laboratory er-16 KAM 11/16/2015 *it= vec_madd (*cIt, *dIt, *bIt); bIt++; cIt++; dIt++; PETE: Adding AltiVec Vector Operator = *it=forEach(expr, DereferenceLeaf(), OpCombine() ); forEach (expr, IncrementLeaf(), NullCombine() ); it=(vector float*)begin(); for (int i=0; i<size()/4; i++) { it++; } Increment Leaf Dereference Leaf OpMulAdd + OpCombine TrinaryNode > A=B+C*D Step 1: Form expression vector float* instead of float* at leaves Step 2: Evaluate expression mulAdd C D A= B New rules for internal nodes Multiply-add produces trinary node Iterate over vectors PETE ForEach: Recursive descent traversal of expression - User defines action performed at leaves - User defines action performed at internal nodes Translation at compile time - Template specialization - Inlining

MIT Lincoln Laboratory er-17 KAM 11/16/2015 Outline Overview –Motivation for using C++ –Expression Templates and PETE –AltiVec Combining PETE and AltiVec Experiments Future Work and Conclusions

MIT Lincoln Laboratory er-18 KAM 11/16/2015 Experiments Results Hand coded loop achieves good performance, but is problem specific and low level Optimized VSIPL performs well for simple expressions, worse for more complex expressions PETE style array operators perform almost as well as the hand-coded loop and are general, can be composed, and are high-level AltiVec loopVSIPLPETE with AltiVec C For loop Direct use of AltiVec extensions Assumes unit stride Assumes vector alignment C AltiVec aware VSIPro Core Lite ( No multiply-add Cannot assume unit stride Cannot assume vector alignment C++ PETE operators Indirect use of AltiVec extensions Assumes unit stride Assumes vector alignment A=B+C*D A=B+C A=B+C*D+E*F A=B+C*D+E/F Software Technology

MIT Lincoln Laboratory er-19 KAM 11/16/2015 Experimental Platform and Method Hardware DY4 CHAMP-AV Board –Contains 4 MPC7400’s and 1 MPC 8420 MPC7400 (G4) –450 MHz –32 KB L1 data cache –2 MB L2 cache –64 MB memory/processor Software VxWorks 5.2 –Real-time OS GCC (non-official release) –GCC with patches for VxWorks –Optimization flags: -O3 -funroll-loops -fstrict-aliasing Method Run many iterations, report average, minimum, maximum time –From 10,000,000 iterations for small data sizes, to 1000 for large data sizes All approaches run on same data Only average times shown here Only one G4 processor used Use of the VxWorks OS resulted in very low variability in timing High degree of confidence in results Use of the VxWorks OS resulted in very low variability in timing High degree of confidence in results

MIT Lincoln Laboratory er-20 KAM 11/16/2015 Experiment 1: A=B+C Peak throughput similar for all approaches VSIPL has some overhead for small data sizes - VSIPL calls cannot be inlined by the compiler - VSIPL makes no assumptions about data alignment/stride Peak throughput similar for all approaches VSIPL has some overhead for small data sizes - VSIPL calls cannot be inlined by the compiler - VSIPL makes no assumptions about data alignment/stride L1 Cache overflow L2 Cache overflow

MIT Lincoln Laboratory er-21 KAM 11/16/2015 Experiment 2: A=B+C*D Loop and PETE/AltiVec both outperform VSIPL - VSIPL implementation creates a temporary to hold multiply result (no multiply-add in Core Lite) All approaches have similar performance for very large data sizes PETE/AltiVec adds little overhead compared to hand coded loop Loop and PETE/AltiVec both outperform VSIPL - VSIPL implementation creates a temporary to hold multiply result (no multiply-add in Core Lite) All approaches have similar performance for very large data sizes PETE/AltiVec adds little overhead compared to hand coded loop L1 Cache overflow L2 Cache overflow L1 Cache overflow L2 Cache overflow

MIT Lincoln Laboratory er-22 KAM 11/16/2015 Experiment 3: A=B+C*D+E*F Loop and PETE/AltiVec both outperform VSIPL - VSIPL implementation must create temporaries to hold intermediate results (no multiply-add in Core Lite) All approaches have similar performance for very large data sizes PETE/AltiVec has some overhead compared to hand coded loop Loop and PETE/AltiVec both outperform VSIPL - VSIPL implementation must create temporaries to hold intermediate results (no multiply-add in Core Lite) All approaches have similar performance for very large data sizes PETE/AltiVec has some overhead compared to hand coded loop L1 Cache overflow L2 Cache overflow L1 Cache overflow

MIT Lincoln Laboratory er-23 KAM 11/16/2015 Experiment 4: A=B+C*D-E/F Loop and PETE/AltiVec have similar performance - PETE/AltiVec actually outperforms loop for some sizes Peak throughput similar for all approaches - VSIPL implementation must create temporaries to hold intermediate results - VSIPL divide algorithm is probably better Loop and PETE/AltiVec have similar performance - PETE/AltiVec actually outperforms loop for some sizes Peak throughput similar for all approaches - VSIPL implementation must create temporaries to hold intermediate results - VSIPL divide algorithm is probably better L1 Cache overflow L2 Cache overflow Better divide algorithm L1 Cache overflow

MIT Lincoln Laboratory er-24 KAM 11/16/2015 Outline Overview –Motivation for using C++ –Expression Templates and PETE –AltiVec Combining PETE and AltiVec Experiments Future Work and Conclusions

MIT Lincoln Laboratory er-25 KAM 11/16/2015 Expression Templates and VSIPL++ VSIPL C binding Object based Procedural VSIPL++ (Proposal) C++ binding Object oriented Generic HPEC-SI 1 Goals: – Simplify interface – Improve performance Implementation can and should use expression templates to achieve these goals 1 HPEC-SI = High Performance Embedded Computing Software Initiative

MIT Lincoln Laboratory er-26 KAM 11/16/2015 Conclusions Expression templates support a high-level API Expression templates can take advantage of the SIMD AltiVec C/C++ language extensions Expression templates provide the ability to compose complex operations from simple operations without sacrificing performance C libraries cannot provide this ability to compose complex operations while retaining performance –C lacks templates and template specialization capability –C library calls cannot be inlined The C++ VSIPL binding (VSIPL++) should allow implementors to take advantage of expression template technology