Extensions to Structure Layout Optimizations in the Open64 Compiler Michael Lai AMD.

Slides:

Advertisements

Similar presentations

Advertisements

Garbage collection David Walker CS 320. Where are we? Last time: A survey of common garbage collection techniques –Manual memory management –Reference.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Compiler Challenges for High Performance Architectures

Faculty of Computer Science © 2008 José Nelson Amaral MPADS: Memory- Pooling-Assisted Data Splitting Stephen Curial - Xymbiant Systems Inc. Peng Zhao -

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.

Optimizing General Compiler Optimization M. Haneda, P.M.W. Knijnenburg, and H.A.G. Wijshoff.

Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.

© 2002 IBM Corporation IBM Toronto Software Lab October 6, 2004 | CASCON2004 Interprocedural Strength Reduction Shimin Cui Roch Archambault Raul Silvera.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Parallelizing Compilers Presented by Yiwei Zhang.

Distributed In Vivo Testing of Software Applications Matt Chu, Christian Murphy, Gail Kaiser Columbia University.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.

1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College

DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.

1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

5.3 Machine-Independent Compiler Features

P ARALLEL P ROCESSING I NSTITUTE · F UDAN U NIVERSITY 1.

Flex Compiler Compiler Case Study By Mee Ka Chang.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

A Compiler-Based Tool for Array Analysis in HPC Applications Presenter: Ahmad Qawasmeh Advisor: Dr. Barbara Chapman 2013 PhD Showcase Event.

Hardware Assisted Control Flow Obfuscation for Embedded Processors Xiaoton Zhuang, Tao Zhang, Hsien-Hsin S. Lee, Santosh Pande HIDE: An Infrastructure.

Object Model Cache Locality Abstract In modern computer systems the major performance bottleneck is memory latency. Multi-layer cache hierarchies are an.

Thread-Level Speculation Karan Singh CS

Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.

Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.

Memory Management. Memory  Commemoration or Remembrance.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Static Program Analysis of Embedded Software Ramakrishnan Venkitaraman Graduate Student, Computer Science Advisor: Dr. Gopal Gupta

Chapter 1 Introduction. Chapter 1 - Introduction 2 The Goal of Chapter 1 Introduce different forms of language translators Give a high level overview.

CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.

Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Benjamin Perry and Martin Swany University of Delaware Computer Information Science.

NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Full and Para Virtualization

A Generalized Architecture for Bookmark and Replay Techniques Thesis Proposal By Napassaporn Likhitsajjakul.

Scheduling Issues on a Heterogeneous Single ISA Multicore IRISA, France Robert Guziolowski, André Seznec. Contact: 1. M. Becchi and P.

Open64 | The Open Research Compiler Ben Reinhardt and Cliff Piontek.

Improving Cache Performance of OCaml Programs Case Study - MetaPRL Alexey Nogin and Alexei Kopylov April 15, 1999.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.

Code Optimization.

CS4961 Parallel Programming Lecture 11: Data Locality, cont

Programming Languages

Feedback directed optimization in Compaq’s compilation tools for Alpha

A Framework for Safe Automatic Data Reorganization Shimin Cui (Speaker), Yaoqing Gao, Roch Archambault, Raul Silvera IBM Toronto Software Lab Peng Peers.

Performance Optimization for Embedded Software

Optimizing Transformations Hal Perkins Autumn 2011

Optimizing Transformations Hal Perkins Winter 2008

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

October 18, 2018 Kit Barton, IBM Canada

Intermediate Code Generation

Lecture 19: Code Optimisation

Code Transformation for TLB Power Reduction

Programming with Shared Memory - 3 Recognizing parallelism

Programming with Shared Memory Specifying parallelism

Presentation transcript:

Extensions to Structure Layout Optimizations in the Open64 Compiler Michael Lai AMD

Related Work Structure splitting, structure peeling, structure field reordering (Hagog & Tice, Hundt, Mannarswamy & Chakrabarti) Above implemented in the Open64 Compiler (Chakrabarti & Chow) Structure instance interleaving (Truong, Bodin & Seznec) Data splitting (Curial, Zhao & Amaral) Array reshaping (Zhao, Cui, Gao, Silvera & Amaral)

Current Framework source WHIRL.o WHIRL ipa_link frontend ipl

Instance Interleaving a[0].field_1 a[0].field_2 an “instance” of the structure a[0].field_3 a[1].field_1 a[1].field_2 another “instance” of the structure a[1].field_3...

Instance Interleaving a[0].field_1 field_1 of all the instances are a[1].field_1 interleaved together … a[0].field_2 field_2 of all the instances are a[1].field_2 interleaved together … a[0].field_3 field_3 of all the instances are a[1].field_3 interleaved together...

Instance Interleaving array[0].field_1 array[0].field_2 array[0].field_3 … array[0].field_m array[1].field_1 array[1].field_2 array[1].field_3 … array[1].field_m … array[n-1].field_1 array[n-1].field_2 array[n-1].field_3 … array[n-1].field_m array[0].field_1 array[1].field_1 array[2].field_1 … array[n-1].field_1 array[0].field_2 array[1].field_2 array[2].field_2 … array[n-1].field_2 … array[0].field_m array[1].field_m array[2].field_m … array[n-1].field_m

Implementation Profitability analysis (done in ipl) – During ipl compilation of each source file, access patterns of structure fields are analyzed and their usage statistics recorded – After all the functions have been compiled by ipl, the “most likely to benefit” structure (if any) is marked and passed to ipo – (By way of illustration, the ideal structure is one with many fields, each of which appearing in its own hot loop)

Implementation Legality analysis (done in ipo) – Usual checking for address taken, escaped types, etc. Code transformation (done in ipo) – Create internal pointers ptr_1, ptr_2, …, ptr_m to keep track of the m locations array[0].field_1, array[0].field_2, …, array[0].field_m – Rewrite array[i].field_j to ptr_j[i], if “i” is known; otherwise, incur additional overhead to compute “i”

Instance Interleaving array[0].field_1 array[0].field_2 array[0].field_3 … array[0].field_m array[1].field_1 array[1].field_2 array[1].field_3 … array[1].field_m … array[n-1].field_1 array[n-1].field_2 array[n-1].field_3 … array[n-1].field_m array[0].field_1 array[1].field_1 array[2].field_1 … array[n-1].field_1 array[0].field_2 array[1].field_2 array[2].field_2 … array[n-1].field_2 … array[0].field_m array[1].field_m array[2].field_m … array[n-1].field_m = ptr_1 = ptr_2 = ptr_m array[i].field_j becomes ptr_j[i]

Array Remapping field_1 field_2 field_3 … field_m field_1 field_2 field_3 … field_m … field_1 field_2 field_3 … field_m field_1 … field_1 field_2 … field_2 … field_m … field_m iteration 0 iteration 1 iteration n-1 iteration 0 iteration 1 iteration 2 iteration n-1 iteration 0 iteration 1 iteration 2 iteration n-1 iteration 0 iteration 1 iteration 2 iteration n-1 a[0] a[1] a[2] … a[m-1] a[m] a[m+1] a[m+2] … a[2m-1] … a[(n-1)m] a[(n-1)m+1] a[(n-1)m+2] … a[nm-1] a[0] a[1] a[2] … a[n-1] a[n] a[n+1] a[n+2] … a[2n-1] … a[(m-1)n] a[(m-1)n+1] a[(m-1)n+2] … a[mn-1]

Implementation Profitability analysis (done in ipl) – During ipl compilation of each source file, discover if there are arrays that behave like structures and suffer poor data cache utilization at the same time – After all the functions have been compiled by ipl, the “most likely to benefit” arrays (if any) are marked and passed to ipo – For each of these arrays, record the stride, group size, and array size associated with it

Implementation Legality analysis (done in ipo) – Check for array aliasing, address taken, argument passing, etc. Code transformation (done in ipo) – Construct the array remapping permutation alpha(i) = (i % m) * n + (i / m), where m is the group size and n is the number of such groups – Rewrite a[i] to a[alpha(i)]

Array Remapping field_1 field_2 field_3 … field_m field_1 field_2 field_3 … field_m … field_1 field_2 field_3 … field_m field_1 … field_1 field_2 … field_2 … field_m … field_m iteration 0 iteration 1 iteration n-1 iteration 0 iteration 1 iteration 2 iteration n-1 iteration 0 iteration 1 iteration 2 iteration n-1 iteration 0 iteration 1 iteration 2 iteration n-1 a[0] a[1] a[2] … a[m-1] a[m] a[m+1] a[m+2] … a[2m-1] … a[(n-1)m] a[(n-1)m+1] a[(n-1)m+2] … a[nm-1] a[0] a[1] a[2] … a[n-1] a[n] a[n+1] a[n+2] … a[2n-1] … a[(m-1)n] a[(m-1)n+1] a[(m-1)n+2] … a[mn-1] a[i] becomes a[(i%m)*n+(i/m)]

Performance Results AMD systemspeed (1-copy) runrate (12-copy) run 462.libquantum (structure peeling) +6.35%+43.43% 429.mcf (instance interleaving) +2.43%+38.38% 470.lbm (array remapping)-16.35% (degradation) % Intel systemspeed (1-copy) runrate (4-copy) run 462.libquantum (structure peeling) +7.01%+24.30% 429.mcf (instance interleaving) -6.04% (degradation)+34.62% 470.lbm (array remapping)-23.28% (degradation) %

Future Work Integrate existing structure layout optimizations with the new structure instance interleaving work Combine profitability heuristics of all structure layout optimizations Extend structure instance interleaving optimization to more than one structure Extend array remapping optimization to multi- dimensional arrays

References 1.G. Chakrabarti and F. Chow. “Structure Layout Optimizations in the Open64 Compiler.” Proceedings of the Open64 Workshop, Boston, M. Hagog and C. Tice. “Cache Aware Data Layout Reorganization Optimization in gcc.” Proceedings of the gcc Developers Summit, R. Hundt, S. Mannarswamy, and D.R. Chakrabarti. “Practical Structure Layout Optimization and Advice.” Proceedings of the International Symposium on Code Generation and Optimization, New York, D.N. Truong, F. Bodin, and A. Seznec. “Improving Cache Behavior of Dynamically Allocated Data Structures.” Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, Washington D.C., 1998.