Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.

Slides:



Advertisements
Similar presentations
A Structure Layout Optimization for Multithreaded Programs Easwaran Raman, Princeton Robert Hundt, Google Sandya S. Mannarswamy, HP.
Advertisements

Compiler-Driven Data Layout Transformation for Heterogeneous Platforms
Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Lecture 6: Multicore Systems
Computer Abstractions and Technology
The Linux Kernel: Memory Management
Computer Organization and Architecture
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Extensions to Structure Layout Optimizations in the Open64 Compiler Michael Lai AMD.
1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )
File System Implementation CSCI 444/544 Operating Systems Fall 2008.
Rajiv Gupta Chen Tian, Min Feng, Vijay Nagarajan Speculative Parallelization of Applications on Multicores.
© 2002 IBM Corporation IBM Toronto Software Lab October 6, 2004 | CASCON2004 Interprocedural Strength Reduction Shimin Cui Roch Archambault Raul Silvera.
Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,
Embedded Computing From Theory to Practice November 2008 USTC Suzhou.
Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009.
Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Comp-TIA Standards.  AMD- (Advanced Micro Devices) An American multinational semiconductor company that develops computer processors and related technologies.
Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging.
Computer Performance Computer Engineering Department.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Copyright © 2007 Heathkit Company, Inc. All Rights Reserved PC Fundamentals Presentation 27 – A Brief History of the Microprocessor.
Compiler Construction
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
History of Microprocessor MPIntroductionData BusAddress Bus
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.
A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Huiwei Lv, Yuan Cheng, Lu Bai, Mingyu Chen, Dongrui Fan, Ninghui Sun Institute of Computing.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Transparent Pointer Compression for Linked Data Structures June 12, 2005 MSP Chris Lattner Vikram Adve.
Benjamin Perry and Martin Swany University of Delaware Computer Information Science.
NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Processors with Hyper-Threading and AliRoot performance Jiří Chudoba FZÚ, Prague.
Chap 4: Processors Mainly manufactured by Intel and AMD Important features of Processors: Processor Speed (900MHz, 3.2 GHz) Multiprocessing Capabilities.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Just-In-Time Compilation. Introduction Just-in-time compilation (JIT), also known as dynamic translation, is a method to improve the runtime performance.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Computer Organization Exam Review CS345 David Monismith.
CCR Autunno 2008 Gruppo Server
ECE232: Hardware Organization and Design
Introduction to Programming
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Feedback directed optimization in Compaq’s compilation tools for Alpha
Pyramid Sketch: a Sketch Framework
IA-64 Microarchitecture --- Itanium Processor
A Framework for Safe Automatic Data Reorganization Shimin Cui (Speaker), Yaoqing Gao, Roch Archambault, Raul Silvera IBM Toronto Software Lab Peng Peers.
Performance Optimization for Embedded Software
A Practical Stride Prefetching Implementation in Global Optimizer
Multithreading Why & How.
Multi-Core Programming Assignment
Chapter 11 Processor Structure and function
Presentation transcript:

Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.

Open64 Workshop Outline  Motivation  Types of structure layout optimizations  Criteria for structure layout optimizations  Implementation details  Performance results  Future work  Conclusion

Open64 Workshop Motivation  Poor data locality in many applications  High data cache miss rates  Growing gap between processor and memory speeds Our Approach  Change layout of data structures  Requires whole-program optimization  Use Inter-Procedural Analysis and Optimizations (IPA) Our Aim  Make applications more cache-friendly

Open64 Workshop IPA  Summarization  Analysis  Optimization

Open64 Workshop Types of Structure Layout Optimizations  Structure splitting  Structure peeling struct struct_A { double d1; double d2; int i; float f; long long l; char c; struct struct_A * next; }; struct struct_A { double d1; double d2; int i; float f; long long l; char c; };

Open64 Workshop Structure Splitting Example struct new_struct_A { double d1; int i; long long l; struct new_struct_A * next; struct cold_sub_struct_A * p; }; struct struct_A { double d1; double d2; int i; float f; long long l; char c; struct struct_A * next; }; struct cold_sub_struct_A { double d2; float f; char c; };

Open64 Workshop Structure Peeling Example struct new_struct_A { double d1; int i; long long l; }; struct struct_A { double d1; double d2; int i; float f; long long l; char c; }; struct cold_sub_struct_A { double d2; float f; char c; };

Open64 Workshop Criteria for structure layout optimizations  Legality Analysis  Type cast  Address of a field is taken  Escaped types  Parameter types  Full visibility to IPA  Alignment restrictions  Profitability Analysis  Hotness  Affinity  Field accesses at loop level  Size

Open64 Workshop Implementation Details Step 1: Type information summarization (IPL) Step 2: Symbol table merging (IPA) Step 3: Legality and profitability analysis (IPA analysis) Step 4: Transforming the program (IPA optimization)

Open64 Workshop Implementation Details: Type information summarization  Information summarization in IPL  Framework for computing static profiles using heuristics  New TY flag TY_NO_SPLIT  SUMMARY_TY_INFO  SUMMARY_LOOP  For each DO_LOOP, WHILE_DO, DO_WHILE  Bit-vector to track field accesses of up to N structure for each loop  Considers field accesses immediately inside loop  These fields are considered affine to each other  Execution count of statements immediately inside loop  From statically estimated profiles or from runtime feedback

Open64 Workshop Implementation Details: IPA Analysis  Inter-procedurally update statically estimated execution count of PUs  Update statically estimated loop frequencies in SUMMARY_LOOP  Consider SUMMARY_LOOP from the hottest P PUs  Determine candidates for structure-layout transformation  Determine new layout of structures

Open64 Workshop Implementation Details: IPA Analysis Example F4F4 F3F3 F2F2 F1F1 BV L1L L2L L3L L4L L5L F4F4 F3F3 F2F2 F1F1 AG 1 40 AG 2 14 AG 3 88 L i — Loops F j — Fields in a struct AG k — Affinity groups

Open64 Workshop Implementation Details: Transforming the program struct S struct T { // N fields // AG1 fields struct T * p; // AG2 fields // M fields }; }; // peel T struct S { // N fields struct T1 * p1; struct T2 * p2; // M fields };  New type definitions  Field table update  Field access statements  New symbols  Assignment statements Example: struct T1 struct T2 { // AG1 fields // AG2 fields };

Open64 Workshop Implementation Details: Transforming the program (continued) Function calls to memory management routines Example: p = (T *) malloc (N * sizeof (T)) if (p == NULL) exit (1);  Detect memory management routine calls involving transformed type T  Replicate call, assignment statements  Update size of memory being allocated  Handle comparisons involving pointer p

Open64 Workshop Performance Results Compilations options: -Ofast at 32-bit ABI Speedup due to structure layout optimizations Benchmarks AMD Opteron™ (2.8GHz, 4GB, 1MB) AMD Barcelona(2. 0GHz, 8GB, 512KB) Intel® EM64T(3.4G Hz, 4GB, 1MB) Intel® Core™(3.0 GHz, 4GB, 4MB) SiCortex MIPS®(500MHz, 4GB, 256KB) Geometric Mean 179.art134%66%56%47%41%62.5% 181.mcf24%23% 31%13%22.0% 462.libquantum32%17%40%72%62%39.6% Geometric Mean46.9%29.6%37.2%47.2%32.1% 37.9%

Open64 Workshop Performance Results (continued) Compilations options: -Ofast at 64-bit ABI Speedup due to structure layout optimizations Benchmarks AMD Opteron™ (2.8GHz, 4GB, 1MB) AMD Barcelona(2. 0GHz, 8GB, 512KB) Intel® EM64T(3.4G Hz, 4GB, 1MB) Intel® Core™(3.0 GHz, 4GB, 4MB) SiCortex MIPS®(500MHz, 4GB, 256KB) Geometric Mean 179.art169%66%53%60%45%69.3% 181.mcf25%35%12%30%7%18.6% 462.libquantum82%51%75%70%69%68.6% Geometric Mean70.2%49.0%36.3%50.1%27.9% 44.6%

Open64 Workshop Performance Results (continued) Compilations options: -Ofast at 64-bit ABI Multiple copies of 462.libquantum running on multi-core chip Platform: Quad-core AMD Barcelona (2.0 GHz, 8GB, 512KB, 2MB) 3 rd level cache shared among 4 cores Speedup from structure layout optimizations Benchmark1 copy2 copies4 copies 462.libquantum51%69%123%

Open64 Workshop Future Work  Tune static profile estimation  Less restrictions  Integrate with field-reordering

Open64 Workshop Conclusion  A framework for performing structure layout transformations is now available in the Open64 compiler.  The superior infrastructure in the Open64 compiler helped us implement the optimizations cleanly and with relatively less effort.  Substantial speedups are possible on some of the CPU2000 and CPU2006 SPEC benchmarks.  Structure layout optimization is a required feature for a compiler to remain competitive.