Presentation is loading. Please wait.

Presentation is loading. Please wait.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

Similar presentations


Presentation on theme: "5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava."— Presentation transcript:

1 5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava **, Jonghee Yoon and Yunheung Paek **Compiler and Microarchitecture Lab, Center for Embedded Systems, Arizona State University, Tempe, AZ, USA. * Embedded Systems Research Lab, ECE, Ulsan Nat’l Institute of Science & Tech, Ulsan, Korea Software Optimization And Restructuring, Department of Electrical Engineering, Seoul National University, Seoul, South Korea 2010-01-25

2 Coarse-Grained Reconfigurable Array (CGRA) SO&R and CML Research Group 2  High computation throughput  Low power consumption and scalability  High flexibility with fast configuration CategoryProcessorMIPSmWMIPS/mW EmbeddedXscale125016000.78 DSPTI TM320C64559.573.32.9 DSP(VLIW)TI TM320C614T4.7110.677 * CGRA shows 10~100MIPS/mW

3 Coarse-Grained Reconfigurable Array (CGRA) SO&R and CML Research Group 3  Array of PE  Mesh like network  Operate on the result of their neighbor PE  Execute computation intensive kernel

4 Application mapping in CGRA SO&R and CML Research Group 4  Mapping DFG on PE array mapping space  Should satisfy several conditions  Should map nodes on the PE which have a right functionality  Data transfer between nodes should be guaranteed  Resource consumption should be minimized for performance

5 CGRA execution & data mapping 5 t c : computation time, t d : data transfer time PE Configuration Memory Main Memory Bk1 buf2 Bk2 buf2 Bk3 buf2 Bk4 buf2 DMA Bk1 buf1 Bk2 buf1 Bk3 buf1 Bk4 buf1 Local memory Double buffering Total runtime = max(t c, t d )

6 The performance bottleneck : Data transfer SO&R and CML Research Group 6  Many multimedia kernels show bigger t d than t c  Average ratio of t c : just 22% Most applications are memory-bound applications. 100% = t c + t d

7 Computation Mapping & Data Mapping SO&R and CML Research Group 7 Duplicate array increase data transfer time Local memory 01 2 LD S[i]LD S[i+1] + S[i] S[i+1] 0 1

8 Contributions of this work SO&R and CML Research Group 8  First approach to consider computation mapping and data mapping - balance t c and t d - minimize duplicate arrays (maximize data reuse) - balance bank utilization  Simple yet effective extension - a set of cost functions - can be plugged in to existing compilation frameworks - E.g., EMS (edge-based modulo scheduling)

9 Application mapping flow SO&R and CML Research Group 9 DFG Performance Bottleneck Analysis Data Reuse Analysis Memory-aware Modulo Scheduling DCRDRG Mapping

10 Preprocessing 1 : Performance bottleneck analysis SO&R and CML Research Group 10  Determines whether it is computation or data transfer that limits the overall performance  Calculate DCR(data-transfer-to-computation time ratio) DCR = t d / t c DCR > 1 : the loop is memory-bound

11 Preprocessing 2 : Data reuse analysis SO&R and CML Research Group 11  Find the amount of potential data reuse  Creates a DRG(Data Reuse Graph)  nodes correspond to memory operations and edge weights approximate the amount of reuse  The edge weight is estimated to be TS - rd TS : the tile size rd : the reuse distance in iterations S[i] S[i+1] D[i] R[i] S[i+5] D[i+10] R2[i]

12 Application mapping flow SO&R and CML Research Group 12 DFG Performance Bottleneck Analysis Data Reuse Analysis Memory-aware Modulo Scheduling DCRDRG Mapping  DCR & DRG are used for cost calculation

13 Mapping with data reuse opportunity cost SO&R and CML Research Group 13 PE0PE1PE2PE3 0 1 2 3 4 01 3 5 2 7 4 A[i],A[i+1]B[i] Local Memory PE Bank1Bank2 01 3 5 2 7 9 A[i] B[i] A[i+1] 4 8 B[i+1] PE Array 40 50 60 50 x x xx 0 0 0+20 x x xx 40 50 6040 30 x x xx 6 6 Memory-unaware cost Data reuse opportunity cost New total cost (memory unaware cost + DROC)

14 BBC(Bank Balancing Cost) SO&R and CML Research Group 14  To prevent allocating all data to just one bank  BBC(b) = β × A(b) β : the base balancing cost(a design parameter) A(b) : the number of arrays already mapped onto bank b PE0PE1PE2PE3 0 1 2 3 4+10+0 A[i],A[i+1] 0 3 2 5 6 A[i] A[i+1] 4 7 B[i] 1 0 32 5 6 4 1 Cand β : 10 Local Memory PE Bank1Bank2 PE Array

15 Application mapping flow SO&R and CML Research Group 15 DFG Performance Bottleneck Analysis Data Reuse Analysis Memory-aware Modulo Scheduling DCRDRG Mapping Partial Shutdown Exploration

16 Partial Shutdown Exploration SO&R and CML Research Group 16  For a memory-bound loop, the performance is often limited by the memory bandwidth rather than by computation. ≫ Computation resources are surplus.  Partial Shutdown Exploration  on PE rows and the memory banks  find the best configuration that gives the minimum EDP(Energy-Delay Product)

17 Example of Partial shutdown exploration TcTdRER*E 4r-2m180288 10.463012 2r-2m270288 10.012882 7/7r8/3 -/65/--/4 0/12/0r D[…], R[…] S[…] -/0r/20/1/- 4/-/--/5/-3/8/67/-/- S[…] D[…], R[…] 01 2 43 5 6 7 8 LD S[i]LD S[i+1] LD D[i] ST R[i] 17

18 Experimental Setup SO&R and CML Research Group 18  A set of loop kernels from MiBench, multimedia, SPEC 2000 benchmarks  Target architecture  4x4 heterogeneous CGRA(4 memory accessable PE)  4 memory bank, each connected to each row  Connected to its four neighbors and four diagonal ones  Compared with other mapping flow  Ideal : memory unaware + single bank memory architecture  MU : memory unaware mapping(*EMS) + multi bank memory architecture  MA : memory aware mapping + multi bank memory architecture  MA + PSE : MA + partial shutdown exploration * Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures, Hyunchul Park et al, PACT 08

19 Runtime comparison SO&R and CML Research Group 19 Compared with MU The MA reduces the runtime by 30%

20 Energy consumption comparison SO&R and CML Research Group 20 MA + PSE shows 47% energy consumption reduction.

21 Conclusion SO&R and CML Research Group 21  The CGRA provide very high power efficiency while be software programmable.  While previous solutions have focused on the computation speed, we consider the data transfer to achieve higher performance.  We proposed an effective heuristic that considers memory architecture.  It achieves 62% reduction in the energy-delay product which factors into 47% and 28% reductions in the energy consumption and runtime.

22 SO&R and CML Research Group 22 Thank you for your attention!


Download ppt "5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava."

Similar presentations


Ads by Google