An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA Speaker: Zong-Cing Lin

Outline Introduction Related Work Mathematical Theory Experimental Result Conclusion PAS lab@CSIE,NTU2

Introduction In chip multiprocessors, there are several cores that need to access the off-chip memory system Same buses/pins contentions This paper discusses and evaluates a new data reuse framework, specifically customized for embedded CMP executing loop-intensive stencil applications It distinguishes between intra-processor and inter- processor data reuse. PAS lab@CSIE,NTU3

Introduction (cont’d) This paper targets at CMP where Embedded CMP Different on-chip processors can share data through an on-chip L2 cache Optimization of stencil computations PAS lab@CSIE,NTU4

Related Work Optimization of stencil computation by customized compilers. Issues about CMP memory bandwidth bottleneck. PAS lab@CSIE,NTU5

Stencil Computation A common type of computation in embedded array-based application codes. In each iteration of a stencil computation, an array element is updated based on the values of its neighbor elements. PAS lab@CSIE,NTU6

Data Sharing V.S. Data Reuse PAS lab@CSIE,NTU7

Some Mathematical Representation f : I → A, f(I) = FI + ζ, where F is an n × l matrix and ζ is a n- dimensional constant vector. Linear loop transformations can be used to optimize a loop nest. PAS lab@CSIE,NTU8

The Algorithm for Solving a Set of Equations PAS lab@CSIE,NTU9 For V and W processor pairs that share data, the complexity of this algorithm is WY

Two Important Lemmas Lemma1: if processor P2 exhibits self-reuse after loop transformation T2, then processor P1 also exhibits self- reuse after loop transformation T1. Keeping original intra-processor data self-reuse pattern. Lemma2: if the last column of F has only one non-zero entry and processor P2 preserves group-reuse after loop transformation T2, then processor P1 also preserves group-reuse after transformation loop T1 Keeping original intra-processor data group-reuse patter in most cases. PAS lab@CSIE,NTU10

Experimental Environment Simulation: Simics tool-set Private L1 cache Shared L2 cache PAS lab@CSIE,NTU11

Stencil Applications Used in Experiments PAS lab@CSIE,NTU12

Stencil Applications Used in Experiments (cont’d) PAS lab@CSIE,NTU13

Savings in the Off-chip Memory References and Execution Cycles PAS lab@CSIE,NTU14

Reduction in Execution Cycles with Different Processor Counts PAS lab@CSIE,NTU15

Conclusion Minimizing the number of off-chip memory references is very important in embedded chip multiprocessors from Performance perspective Power perspective This paper proposes and evaluates a compiler- based solution to stencil computations by re-organizing loop iterations assigned to processors in a coordinated fashion so that the reuse distance to shared data is minimized. PAS lab@CSIE,NTU16

Any Questions? 17PAS lab@CSIE,NTU

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

Similar presentations

Presentation on theme: "An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

Similar presentations

Presentation on theme: "An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA."— Presentation transcript:

Similar presentations

About project

Feedback