5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava **, Jonghee Yoon and Yunheung Paek **Compiler and Microarchitecture Lab, Center for Embedded Systems, Arizona State University, Tempe, AZ, USA. * Embedded Systems Research Lab, ECE, Ulsan Nat’l Institute of Science & Tech, Ulsan, Korea Software Optimization And Restructuring, Department of Electrical Engineering, Seoul National University, Seoul, South Korea
Coarse-Grained Reconfigurable Array (CGRA) SO&R and CML Research Group 2 High computation throughput Low power consumption and scalability High flexibility with fast configuration CategoryProcessorMIPSmWMIPS/mW EmbeddedXscale DSPTI TM320C DSP(VLIW)TI TM320C614T * CGRA shows 10~100MIPS/mW
Coarse-Grained Reconfigurable Array (CGRA) SO&R and CML Research Group 3 Array of PE Mesh like network Operate on the result of their neighbor PE Execute computation intensive kernel
Application mapping in CGRA SO&R and CML Research Group 4 Mapping DFG on PE array mapping space Should satisfy several conditions Should map nodes on the PE which have a right functionality Data transfer between nodes should be guaranteed Resource consumption should be minimized for performance
CGRA execution & data mapping 5 t c : computation time, t d : data transfer time PE Configuration Memory Main Memory Bk1 buf2 Bk2 buf2 Bk3 buf2 Bk4 buf2 DMA Bk1 buf1 Bk2 buf1 Bk3 buf1 Bk4 buf1 Local memory Double buffering Total runtime = max(t c, t d )
The performance bottleneck : Data transfer SO&R and CML Research Group 6 Many multimedia kernels show bigger t d than t c Average ratio of t c : just 22% Most applications are memory-bound applications. 100% = t c + t d
Computation Mapping & Data Mapping SO&R and CML Research Group 7 Duplicate array increase data transfer time Local memory 01 2 LD S[i]LD S[i+1] + S[i] S[i+1] 0 1
Contributions of this work SO&R and CML Research Group 8 First approach to consider computation mapping and data mapping - balance t c and t d - minimize duplicate arrays (maximize data reuse) - balance bank utilization Simple yet effective extension - a set of cost functions - can be plugged in to existing compilation frameworks - E.g., EMS (edge-based modulo scheduling)
Application mapping flow SO&R and CML Research Group 9 DFG Performance Bottleneck Analysis Data Reuse Analysis Memory-aware Modulo Scheduling DCRDRG Mapping
Preprocessing 1 : Performance bottleneck analysis SO&R and CML Research Group 10 Determines whether it is computation or data transfer that limits the overall performance Calculate DCR(data-transfer-to-computation time ratio) DCR = t d / t c DCR > 1 : the loop is memory-bound
Preprocessing 2 : Data reuse analysis SO&R and CML Research Group 11 Find the amount of potential data reuse Creates a DRG(Data Reuse Graph) nodes correspond to memory operations and edge weights approximate the amount of reuse The edge weight is estimated to be TS - rd TS : the tile size rd : the reuse distance in iterations S[i] S[i+1] D[i] R[i] S[i+5] D[i+10] R2[i]
Application mapping flow SO&R and CML Research Group 12 DFG Performance Bottleneck Analysis Data Reuse Analysis Memory-aware Modulo Scheduling DCRDRG Mapping DCR & DRG are used for cost calculation
Mapping with data reuse opportunity cost SO&R and CML Research Group 13 PE0PE1PE2PE A[i],A[i+1]B[i] Local Memory PE Bank1Bank A[i] B[i] A[i+1] 4 8 B[i+1] PE Array x x xx x x xx x x xx 6 6 Memory-unaware cost Data reuse opportunity cost New total cost (memory unaware cost + DROC)
BBC(Bank Balancing Cost) SO&R and CML Research Group 14 To prevent allocating all data to just one bank BBC(b) = β × A(b) β : the base balancing cost(a design parameter) A(b) : the number of arrays already mapped onto bank b PE0PE1PE2PE A[i],A[i+1] A[i] A[i+1] 4 7 B[i] Cand β : 10 Local Memory PE Bank1Bank2 PE Array
Application mapping flow SO&R and CML Research Group 15 DFG Performance Bottleneck Analysis Data Reuse Analysis Memory-aware Modulo Scheduling DCRDRG Mapping Partial Shutdown Exploration
Partial Shutdown Exploration SO&R and CML Research Group 16 For a memory-bound loop, the performance is often limited by the memory bandwidth rather than by computation. ≫ Computation resources are surplus. Partial Shutdown Exploration on PE rows and the memory banks find the best configuration that gives the minimum EDP(Energy-Delay Product)
Example of Partial shutdown exploration TcTdRER*E 4r-2m r-2m /7r8/3 -/65/--/4 0/12/0r D[…], R[…] S[…] -/0r/20/1/- 4/-/--/5/-3/8/67/-/- S[…] D[…], R[…] LD S[i]LD S[i+1] LD D[i] ST R[i] 17
Experimental Setup SO&R and CML Research Group 18 A set of loop kernels from MiBench, multimedia, SPEC 2000 benchmarks Target architecture 4x4 heterogeneous CGRA(4 memory accessable PE) 4 memory bank, each connected to each row Connected to its four neighbors and four diagonal ones Compared with other mapping flow Ideal : memory unaware + single bank memory architecture MU : memory unaware mapping(*EMS) + multi bank memory architecture MA : memory aware mapping + multi bank memory architecture MA + PSE : MA + partial shutdown exploration * Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures, Hyunchul Park et al, PACT 08
Runtime comparison SO&R and CML Research Group 19 Compared with MU The MA reduces the runtime by 30%
Energy consumption comparison SO&R and CML Research Group 20 MA + PSE shows 47% energy consumption reduction.
Conclusion SO&R and CML Research Group 21 The CGRA provide very high power efficiency while be software programmable. While previous solutions have focused on the computation speed, we consider the data transfer to achieve higher performance. We proposed an effective heuristic that considers memory architecture. It achieves 62% reduction in the energy-delay product which factors into 47% and 28% reductions in the energy consumption and runtime.
SO&R and CML Research Group 22 Thank you for your attention!