Download presentation
Presentation is loading. Please wait.
Published byJoshua Gerard Robinson Modified over 9 years ago
1
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science & Technology) ARC March 21, 2012 Hong Kong
2
Reconfigurable Architecture 2/20 Reconfigurable arc hitecture High performance Flexible Cf. ASIC Energy efficient Cf. GPU Source: ChipDesignMag.com
3
Coarse-Grained Reconfigurable Architecture 3 /20 Coarse-Grained RA Word-level granularity Dynamic reconfigurability Simpler to compile Execution model Main Processor CGRA Main Memory DMA Controller MorphoSys ADRES
4
Application Mapping 4 /20 Place and route DFG on the PE array mapping space Should satisfy several constraints Should map nodes on the PE which have a right functionality Data transfer between nodes should be guaranteed Resource consumption should be minimized for performance Application IR Front-end Partitioner Conventional C compilation ConfigurationAssembly Exec. + Config. Extended assembler Seq Code Loops Place & Route DFG generation Arch Param. Mapping for CGRA
5
Modulo scheduling-based mapping 5 /20 Software Pipelining time 01 2 4 3 A[i]B[i] C[i] PE0 PE3 PE1 PE2 PE0PE1PE2PE3 1 2 3 4 5 6 7 01 2 4 3 01 2 4 3 01 2 4 3 II = 2 cycles II : Initiation Interval
6
Suffer several problems in a large scale CGRA Lack of parallelism Limited ILP in general applications Configuration size(in unrolling case) Search a very large mapping space for placement and routing Skyrocketing compilation time CGRAs remain at 4x4 or 8x8 at the most. 6 /20 Problem - Scalability
7
Overview 7 /20 Background SIMD Reconfigurable Architecture (SIMD RA) Mapping on SIMD RA Evaluation
8
Consists of multiple identical parts, called cores Identical for the reuse of configurations At least one load-store PE in each core 8 /20 SIMD Reconfigurable Architecture Crossbar Switch Bank1Bank2Bank3Bank4 Core 1 Core 2 Core 3 Core 4
9
More iterations executed in parallel Scale with the PE array size Short compilation time thanks to small mapping space Archive denser scheduled configuration Higher utilization and performance. Loop must not have loop-carried dependence. 9 /20 Advantages of SIMD-RA time Large Core Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 time Core 1Core 2Core 3Core 4 Iter. 0 Iter. 1 Iter. 2 Iter. 3 Iter. 4 Iter. 5 Iter. 6 Iter. 7 Iter. 8 Iter. 9 Iter. 10 Iter. 11 Large Core Core 1Core 2 Core 3Core 4
10
Overview 10 /20 Background SIMD Reconfigurable Architecture (SIMD RA) Bank Conflict Minimization in SIMD RA Evaluation
11
New mapping problem Iteration-to-core mapping Iteration mapping affects on the performance related with a data mapping affect the number of bank conflicts 11 /20 Problems of SIMD RA mapping for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i]; } Core 1Core 2 Core 3Core 4 15 iterations
12
Iteration-to-core mappingData mapping 12 /20 Mapping schemes Iter. 0-3 Iter. 4-7 Iter. 12-14 Iter. 8-11 Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 3,7,11 Iter. 2,6,10,14 Crossbar Switch A[0] A[4] A[8] A[12] B[1] B[5] B[9] B[13] A[1] A[5] A[9] A[13] B[2] B[6] B[10] B[14] A[2] A[6] A[10] A[14] B[3] B[7] B[11] A[3] A[7] A[11] B[0] B[4] B[8] B[12] Crossbar Switch A[0] A[1] A[2] A[3] A[4] A[5] A[13] A[14] B[0] B[1] B[2] B[3] B[4] B[5] B[13] B[14] … … for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i]; }
13
With interleaving data placement, interleaved iteration assignment is better than sequential iteration assignment. Weak in stride accesses reduce the number of utilized banks, increase bank conflicts 13 Interleaving data placement Iter. 0-3 Iter. 4-7 Iter. 12-14 Iter. 8-11 Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 3,7,11 Iter. 2,6,10,14 Crossbar Switch A[0] A[4] A[8] A[12] B[1] B[5] B[9] B[13] A[1] A[5] A[9] A[13] B[2] B[6] B[10] B[14] A[2] A[6] A[10] A[14] B[3] B[7] B[11] A[3] A[7] A[11] B[0] B[4] B[8] B[12] Configuration Load A[i] … … … Load A[2i]
14
14 Sequential data placement Cannot work well with SIMD mapping Cause frequent bank conflicts Data tiling i) array base address modification ii) rearranging data on the local memory. Sequential iteration assignment with data tiling suits for SIMD mapping 14 Crossbar Switch A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] A[4] A[5] A[6] A[7] B[4] B[5] B[6] B[7] A[8] A[9] A[10] A[11] B[8] B[9] B[10] B[11] A[12] A[13] A[14] B[12] B[13] B[14] Crossbar Switch A[0] A[1] A[2] A[3] A[4] A[5] A[13] A[14] B[0] B[1] B[2] B[3] B[4] B[5] B[13] B[14] …… Iter. 0-3 Iter. 4-7 Iter. 12-14 Iter. 8-11 Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 3,7,11 Iter. 2,6,10,14 Configuration Load A[i] … … …
15
Two out of the four combinations have strong advantages Interleaved iteration, interleaved data mapping Weak in accesses with stride Simple data management Sequential iteration, sequential data mapping (with data tiling) More robust against bank conflict Data rearranging overhead 15 /20 Summary of Mapping Combinations Analysis
16
Experimental Setup 16 /20 Sets of loop kernels from OpenCV, multimedia and SPEC2000 benchmarks Target system Two CGRA sizes – 4x4, 8x4 2x2 core with one load-store PE and one multiplier PE Mesh + diagonal connections between PEs Full crossbar switch between PEs and local memory banks Compared with non-SIMD mapping Original : non-SIMD previous mapping SIMD : Our approach (interleaving-interleaving mapping)
17
reduced by 61% in 4x4 CGRA, 79% in 8x4 CGRA 17 /20 Configuration Size
18
18 /20 Runtime 29% 32%
19
Presented SIMD reconfigurable architecture Exploit data parallelism and instruction level parallelism at the same time Advantages of SIMD reconfigurable architecture Scale the large number of PEs well Alleviate increasing compilation time Increase performance and reduce configuration size 19 /20 Conclusion
20
Thank you! 20 /20
21
In a large loop case, small core might not be a good match Merge multiple cores ⇒ Macrocore No HW modification require 21 Core size Crossbar Switch Bank1Bank2Bank3Bank4 Core 1 Core 2 Core 3 Core 4 Macrocore 1 Macrocore 2
22
22 SIMD RA mapping flow Operation Mapping Check SIMD Requirement Check SIMD Requirement Select Core Size Iteration Mapping Data Tiling If scheduling fails and MaxII<II, increase core size. Traditional Mapping Fail If scheduling fails, increase II and repeat. Modulo Scheduling Array Placement (Implicit) Array Placement (Implicit) Int-IntSeq-Tiling
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.