Download presentation
Presentation is loading. Please wait.
Published byAugustine Carr Modified over 8 years ago
1
An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA Speaker: Zong-Cing Lin
2
Outline Introduction Related Work Mathematical Theory Experimental Result Conclusion PAS lab@CSIE,NTU2
3
Introduction In chip multiprocessors, there are several cores that need to access the off-chip memory system Same buses/pins contentions This paper discusses and evaluates a new data reuse framework, specifically customized for embedded CMP executing loop-intensive stencil applications It distinguishes between intra-processor and inter- processor data reuse. PAS lab@CSIE,NTU3
4
Introduction (cont’d) This paper targets at CMP where Embedded CMP Different on-chip processors can share data through an on-chip L2 cache Optimization of stencil computations PAS lab@CSIE,NTU4
5
Related Work Optimization of stencil computation by customized compilers. Issues about CMP memory bandwidth bottleneck. PAS lab@CSIE,NTU5
6
Stencil Computation A common type of computation in embedded array-based application codes. In each iteration of a stencil computation, an array element is updated based on the values of its neighbor elements. PAS lab@CSIE,NTU6
7
Data Sharing V.S. Data Reuse PAS lab@CSIE,NTU7
8
Some Mathematical Representation f : I → A, f(I) = FI + ζ, where F is an n × l matrix and ζ is a n- dimensional constant vector. Linear loop transformations can be used to optimize a loop nest. PAS lab@CSIE,NTU8
9
The Algorithm for Solving a Set of Equations PAS lab@CSIE,NTU9 For V and W processor pairs that share data, the complexity of this algorithm is WY
10
Two Important Lemmas Lemma1: if processor P2 exhibits self-reuse after loop transformation T2, then processor P1 also exhibits self- reuse after loop transformation T1. Keeping original intra-processor data self-reuse pattern. Lemma2: if the last column of F has only one non-zero entry and processor P2 preserves group-reuse after loop transformation T2, then processor P1 also preserves group-reuse after transformation loop T1 Keeping original intra-processor data group-reuse patter in most cases. PAS lab@CSIE,NTU10
11
Experimental Environment Simulation: Simics tool-set Private L1 cache Shared L2 cache PAS lab@CSIE,NTU11
12
Stencil Applications Used in Experiments PAS lab@CSIE,NTU12
13
Stencil Applications Used in Experiments (cont’d) PAS lab@CSIE,NTU13
14
Savings in the Off-chip Memory References and Execution Cycles PAS lab@CSIE,NTU14
15
Reduction in Execution Cycles with Different Processor Counts PAS lab@CSIE,NTU15
16
Conclusion Minimizing the number of off-chip memory references is very important in embedded chip multiprocessors from Performance perspective Power perspective This paper proposes and evaluates a compiler- based solution to stencil computations by re-organizing loop iterations assigned to processors in a coordinated fashion so that the reuse distance to shared data is minimized. PAS lab@CSIE,NTU16
17
Any Questions? 17PAS lab@CSIE,NTU
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.