Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors Qingda Lu1, Christophe Alias2, Uday Bondhugula1, Thomas Henretty1,

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

1 Optimizing compilers Managing Cache Bercovici Sivan.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Data Locality CS 524 – High-Performance Computing.

Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College

Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multi-core Processors Muthu Baskaran 1 Naga Vydyanathan 1 Uday Bondhugula.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Microprocessor-based systems Curse 7 Memory hierarchies.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Ohio State Univ Effective Automatic Parallelization of Stencil Computations * Sriram Krishnamoorthy 1 Muthu Baskaran 1, Uday Bondhugula 1, Atanas Rountev.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 1 Department of Computer Science & Engineering.

Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.

By Islam Atta Supervised by Dr. Ihab Talkhan

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

8.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Fragmentation External Fragmentation – total memory space exists to satisfy.

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

Advanced Caches Smruti R. Sarangi.

Computer Engg, IIT(BHU)

Presented by: Nick Kirchem Feb 13, 2004

Memory Segmentation to Exploit Sleep Mode Operation

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Lecture: Large Caches, Virtual Memory

Ioannis E. Venetis Department of Computer Engineering and Informatics

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Xiaodong Wang, Shuang Chen, Jeff Setter,

Lecture: Large Caches, Virtual Memory

COMBINED PAGING AND SEGMENTATION

Chapter 8: Main Memory Source & Copyright: Operating System Concepts, Silberschatz, Galvin and Gagne.

File System Implementation

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Prof. Zhang Gang School of Computer Sci. & Tech.

Lecture: Large Caches, Virtual Memory

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Operating System Concepts

Linchuan Chen, Peng Jiang and Gagan Agrawal

Background Program must be brought into memory and placed within a process for it to be run. Input queue – collection of processes on the disk that are.

A Practical Stride Prefetching Implementation in Global Optimizer

Numerical Algorithms • Parallelizing matrix multiplication

MICROPROCESSOR MEMORY ORGANIZATION

Radu Rugina and Martin Rinard Laboratory for Computer Science

Lecture 3: Main Memory.

A Unified Framework for Schedule and Storage Optimization

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

Die Stacking (3D) Microarchitecture -- from Intel Corporation

Contents Memory types & memory hierarchy Virtual memory (VM)

Chapter 4 Multiprocessors

Ch 17 - Binding Protocol Addresses

A Case for Interconnect-Aware Architectures

Main Memory Background

Efficient Aggregation over Objects with Extent

Overview Problem Solution CPU vs Memory performance imbalance

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors Qingda Lu1, Christophe Alias2, Uday Bondhugula1, Thomas Henretty1, Sriram Krishnamoorthy3, J. Ramanujam4, Atanas Rountev1, P. Sadayappan1, Yongjian Chen5, Haibo Lin6, Tin-Fook Ngai5 1The Ohio State University, 2ENS Lyon – INRIA, 3Pacific Northwest National Lab, 4Louisiana State University, 5Intel Corp., 6IBM China Research Lab 6/2/2019

The Hardware Trends Trend of fabrication technology Gate delay decreases Wire delay increases Trend of chip-multiprocessors (CMPs) More processors Larger caches Private L1 cache Shared last-level cache to reduce off-chip accesses We focus on L2 cache 6/2/2019

Non-Uniform Cache Architecture The problem: Large caches have long latencies The Solution Employ a banked L2 cache organization with non-uniform latencies NUCA designs Static NUCA: Mapping of data into banks is predetermined based on the block index. Dynamic NUCA: Migrating data among banks. Hard to implement due to overheads and design challenges. Physical Address Bank # Log2N bits Log2L bits Offset 6/2/2019

A Tiled NUCA-CMP T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 Proc Private L1$ L2$ Bank Router Directory 6/2/2019

Motivating Example: Parallelizing 1D Jacobi on a 4-tile CMP 1-D Jacobi Code: while (condition) { for (i = 1; i < N-1; i++) B[i] = A[i-1] + A[i] + A[i+1]; A[i] = B[i]; } If we use the standard linear array layout and parallelize the program with the “owner-computes” rule, communication volume is roughly 2N cache lines in every outer iteration. A[0] A[1] A[8] A[9] A[16] A[17] A[24] A[25] A[2] A[3] A[10] A[11] A[18] A[19] A[26] A[27] A[4] A[5] A[12] A[13] A[20]A[21] A[28] A[29] A[6] A[7] A[14] A[15] A[22]A[23] A[30] A[31] Bank 0 Bank 1 Bank 2 Bank 3 Thread 0 Thread 1 Thread 2 Thread 3 B[0] B[1] B[8] B[9] B[16] B[17] B[24] B[25] B[2] B[3] B[10] B[11] B[18] B[19] B[26] B[27] B[4] B[5] B[12] B[13] B[20]B[21] B[28] B[29] B[6] B[7] B[14] B[15] B[22]B[23] B[30] B[31] 6/2/2019

Motivating Example: Parallelizing 1D Jacobi on a 4-tile CMP A[0] A[1] A[8] A[9] A[16] A[17] A[24] A[25] A[2] A[3] A[10] A[11] A[18] A[19] A[26] A[27] A[4] A[5] A[12] A[13] A[20]A[21] A[28] A[29] A[6] A[7] A[14] A[15] A[22]A[23] A[30] A[31] Thread 0 Thread 1 Thread 2 Thread 3 B[0] B[1] B[8] B[9] B[16] B[17] B[24] B[25] B[2] B[3] B[10] B[11] B[18] B[19] B[26] B[27] B[4] B[5] B[12] B[13] B[20]B[21] B[28] B[29] B[6] B[7] B[14] B[15] B[22]B[23] B[30] B[31] If we divide the iteration space into four contiguous partitions and rearrange the data layout, only 6 remote cache lines are requested in every outer iteration. B[16] B[17] B[18] B[19] B[20] B[21] B[22] B[23] A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9] A[10] A[11] A[12] A[13] A[14] A[15] A[16] A[17] A[18] A[19] A[20] A[21] A[22] A[23] A[24] A[25] A[26] A[27] A[28] A[29] A[30] A[31] Bank 0 Bank 1 Bank 2 Bank 3 B[8] B[9] B[10] B[11] B[12] B[13] B[14] B[15] B[24] B[25] B[26] B[27] B[28] B[29] B[30] B[31] B[0] B[1] B[2] B[3] B[4] B[5] B[6] B[7] 6/2/2019

The Approach A general framework for integrated data layout transformation and loop transformations for enhancing data locality on NUCA CMP’s Formalism: polyhedral model Two main steps: Localization analysis: search for an affine mapping of iteration spaces and data spaces such that no iteration accesses any data that is beyond a bounded distance in the target space Code generation: generate efficient indexing code using CLooG by differentiating the “bulk” scenario from the “boundary” case via linear constraints. 6/2/2019

Polyhedral Model An algebraic framework for representing affine programs – statement domains, dependences, array access functions – and affine program transformations Regular affine programs Dense arrays Loop bounds – affine functions of outer loop variables, constants and program parameters Array access functions - affine functions of surrounding loop variables, constants and program parameters 8

S1: a[i][j] = a[j][i] + a[i][j-1]; Polyhedral Model for (i=1; i<=7; i++) for (j=2; j<=6; j++) S1: a[i][j] = a[j][i] + a[i][j-1]; ₣S1,a (xS1) = 1 0 0 1 . i j + 0 -1 j i≥1 i≤7 i j≥2 j≤6 . 0 -1 6 DS1 (xS1) = ≥ 0 -1 0 7 0 1 -2 1 0 -1 ij1 i j xS1 = 9

Computation Allocation and Data Mapping Original iteration space => Integer vector representing the virtual processor space A one-dimensional affine transform (π) for a statement S is given by Data Mapping: Original data space => Integer vector representing the virtual processor space A one-dimensional affine transform (ψ) for an array A is given by π S (xS ) = . xS n 1 CS ψ A (XA ) = . xA n 1 GA

Localized Computation Allocation and Data Mapping Definition: For a program P, let D be its index set, computation allocation π and data mapping ψ for P are localized if and only if for any array A, and any reference FS,A, , where q is a constant. As a special case, communication-free localization can be achieved if and only if for any array A and any array reference FS,A in a statement S, computation allocation π and data mapping ψ satisfy 6/2/2019

Localization Analysis Step 1: Group Interrelated Statements/Arrays Form a bipartite graph: vertices corresponds to statements / arrays and edges connect each statement vertex to its referenced arrays. Find the connected components in the bipartite graph. Step 2: Find Localized Computation Allocation / Data Mapping for each connected component Formulate the problem as finding an affine computation allocation π and an affine data mapping ψ that satisfy for every array reference FS,A Translate the problem to a linear programming problem that minimizes q. 6/2/2019

Localization Analysis Algorithm Require: Array access functions are rewritten to access byte arrays C = ∅ for each array reference FS,A do Obtain new constraints: under and Apply Farkas Lemma to new constraints to obtain linear constraints; eliminate all Farkas multipliers Add linear inequalities from the previous step into C Add objective function (min q) Solve the resulting linear programming problem with constraints in C if ψ and π are found then return π, ψ, and q else return “not localizable” 6/2/2019

Data Layout Transformation Strip-mining An array dimension N  two virtual dimensions Array reference array reference [...][i][...]  [...][ i / d ][i mod d][...] Permutation Array A(...,N1, ...,N2, ...)  A’(...,N2, ...,N1, ...) Array reference A[...][i1][...][i2][...]  A[...][i1][...][i2][...]. 6/2/2019

Data Layout Transformation (Cont.) A[0] A[1] A[8] A[9] A[16] A[17] A[24] A[25] A[2] A[3] A[10] A[11] A[18] A[19] A[26] A[27] A[4] A[5] A[12] A[13] A[20]A[21] A[28] A[29] A[6] A[7] A[14] A[15] A[22]A[23] A[30] A[31] B[0] B[1] B[8] B[9] B[16] B[17] B[24] B[25] B[2] B[3] B[10] B[11] B[18] B[19] B[26] B[27] B[4] B[5] B[12] B[13] B[20]B[21] B[28] B[29] B[6] B[7] B[14] B[15] B[22]B[23] B[30] B[31] The 1D jacobi example: Combination of strip-mining and permutation A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] A[9] A[10] A[11] A[12] A[13] A[14] A[15] A[16] A[17] A[18] A[19] A[20] A[21] A[22] A[23] A[24] A[25] A[26] A[27] A[28] A[29] A[30] A[31] B[0] B[1] B[2] B[3] B[4] B[5] B[6] B[7] B[8] B[9] B[10] B[11] B[12] B[13] B[14] B[15] B[16] B[17] B[18] B[19] B[20] B[21] B[22] B[23] B[24] B[25] B[26] B[27] B[28] B[29] B[30] B[31] Bank 0 Bank 1 Bank 2 Bank 3 6/2/2019

Data Layout Transformation(Cont.) Padding: Inter-array padding: keep the base addresses of arrays aligned to a tile specified by the data mapping function Intra-array padding: align elements inside an array with “holes” to make a strip-mined dimension divisible by its sub-dimensions 6/2/2019

Data Layout Transformation (Cont.) With localized computation allocation and data mapping: If we find communication-free allocation with all arrays share the same stride along the fastest varying dimension, only padding is applied Otherwise we create a blocked view of n-dimensional array A along dimension k with data layout transformations If K=1, we have data layout transformation σ(in, in−1, ..., i2, i1 )= (in, in−1, ..., i2, (i1 mod(N1 /P))/L, i1 /(N1 /P), i1 mod L), L is cache line size in elements and P is processor number. If K>1, we have data layout transformation, σ(in, in−1, ..., i2, i1 )= (in, ..., i1 /L, ik mod (Nk / P)..., i2, ik /(Nk /P), i1 mod L). 6/2/2019

Code Generation It is challenging to generate efficient code Replacing array reference A[u(i))] with A’[σ(u(i))] produces very inefficient code We iterate directly in the data space of the transformed array. In the polyhedral model, we specify an iteration domain D and an affine scheduling function for each statement. Efficient code is then generated by CLooG that implements the Quilleré-Rajopadhye-Wilde algorithm. 6/2/2019

Code Generation (Cont.) Key techniques used in generating code accessing layout transformed arrays While σ is not affine, σ-1 is. We replace constraints jk = σ(uk(i)) by σ-1 (jk) = i. We separate the domain D into two sub-domains Dsteady and Dboundary and generate code for two domains. Dsteady covers the majority of D by simplifying constraints such as (i-u) mod L to (i mod L) – u Boundary cases are automatically handled in Dboundary = D - Dsteady 6/2/2019

Experimental Setup With the Virtutech Simics full-system simulator extended with a timing infrastructure GEMS, we simulated a tiled NUCA-CMP: We evaluated our approach with a set of data-parallel benchmarks Focus of the approach is not on finding parallelism Processor 16 4-way, 4GHz in-order SPARC cores L1 cache private, 64KB I/D cache, 4-way, 64-byte line, 2-cycle access latency L2 cache shared, 8MB uniﬁed cache, 8-way, 64-byte line, 8-cycle access latency Memory 4GB, 320-cycle (80ns) latency, 8 controllers On-chip network 4x4 mesh, 5-cycle per-link latency, 16GB bandwidth per link 6/2/2019

Data Locality Optimization for NUCA-CMPs: The Results Speedup 6/2/2019

Data Locality Optimization for NUCA-CMPs: The Results Link Utilization 6/2/2019

Data Locality Optimization for NUCA-CMPs: The Results Remote L2 Access Number 6/2/2019

Conclusion We believe that the approach developed in this study provides very general treatment of code generation for data layout transformation analysis of data access affinities automated generation of efficient code for the transformed layout 6/2/2019