Download presentation
Presentation is loading. Please wait.
Published byBrian Pearson Modified over 9 years ago
1
Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing Center (EMC 2 ) The Pennsylvania State University International Conference on Computer Design, 10/2-5, 2005, San Jose
2
2 Outline Motivation Related Works Our Approach Example Experimental Results & Conclusion
3
3 Motivation Thermal Hotspots are a cause for concern Caused due to increasing power density Can result in the permanent chip damage How to avoid damage Cooling techniques How to prevent HotSpots Hardware techniques This paper proposes a compiler directed technique to avoid hotspots in CMPs
4
4 Related work: Dynamic Thermal Management When one unit overheats, migrate its functionality to a distant, spare unit Dual pipeline (Intel, ISQED ’02) Spare register file (Skadron et al. 2003) Separate core (CMP) (Heo et al. ISLPED 2003) Microarchitectural clusters (Intel, ICCD 2004) Raises many interesting issues Cost-benefit tradeoff for extra area Use both resources (scheduling) Run-time Thermal sensing/estimation Yesterday, UC Riverside paper @ Session 2.2 proposes a run-time thermal tracking method
5
5 Related work: Design-time techniques MDL @ PSU: Thermal-Aware IP Virtualization and Placement for Networks-on-Chip Architecture, ICCD 2004 Thermal-Aware Allocation and Scheduling for MPSOC Design, DATE 2005 Thermal-Aware Floorplanning Using Genetic Algorithms ISQED 2005 Thermal-Aware Voltage-island architecting, the other paper in this session Other groups: Thermal-Aware High Level Synthesis (Northwestern Univ. Memik, R.Dick (ISLPED 2005, ASP-DAC 2006) Many more in this conference Industry: Gradient Design Automation (a start-up showcases at DAC 2005)
6
6 CMP – Justin R. Rattner, Intel director of the Corporate Technology Group, Spring 2005 IDF Justin R. Rattner “Intel researchers and scientists are experimenting with "many tens of cores, potentially even hundreds of cores per die, per single processor die...” Last night, Panel discussion on CMP Industry examples:
7
7 This paper- compiler approach Temperature and performance sensitive loop scheduling Schedules different loop iterations on CMP Data locality aware and hence performance aware Intuition behind the approach Let ‘hot” cores idle while cool cores work. Static scheduling of parallelized loop iterations at compiler time
8
8 How can the compiler schedule temperature aware code? This work targets loop intensive programs run on embedded CMPs Loop nests are divided into chunks. The number of cycles in a chunk is . Let the starting temperature of a processor be T c The temperature after execution the chunk is T c ‘ = F(T c, , floorplan, power ) , power are obtained by profiling the code. Floorplan and physical parameters remain constant.
9
9 Thermal modeling Want a good model of chip temperature That accounts for adjacency and package That does not require detailed designs That is fast enough for practical use A compact model based on thermal R, C (Hotspot) Parameterized to automatically derive a model based on various Architectures Power models Floorplans Thermal Packages
10
10 Temperature Estimation The temperature of each block depends on the power consumption and the location of blocks. The thermal resistance R ij of PE i with respect to PE j can be represented by units of temperature rise at PE i due to one unit of power dissipated at PE j.
11
11 Running Example Basic Schedule for (i=1; i<=600; i++) for (j=1; j<=1000; j++) B[i][j] = (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) / 4; TimeP0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 106121824 217131925 328142026 439152127 5410162228 6511172329 Jacobi’s Algorithm for (i=k*120+1; i<=(k+1)*120; i++) for (j=1; j<=1000; j++) B[i][j] = (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) / 4; Parallelized Algorithm for 5 cores Parallel Schedule Iteration chunk number Core numberTime Slot
12
12 Analysis of Basic Schedule Analysis Great locality Uses only 5 processors Will definitely overheat TimeP0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 106121824 217131925 328142026 439152127 5410162228 6511172329 Assumptions in the example 1. Initial temperature is 0 2. Threshold temperature is 2 3. An idle slot reduces the temperature by 1 degree ( but 0) 4. So at most 2 active slots can be scheduled together on one core 5. The ideal number of active processors at any time is 5. 6. Due to Jacobi’s algorithm consecutive iteration chunk exhibit locality
13
13 Pure Temperature Aware Scheduling Algorithm Start with time slot as 0 and all iterations as unscheduled While unscheduled iterations exit Select the coolest A processors whose temperature is less than the threshold. Schedule the chunks on those processors at current timeslot. Reduce number of chunks to be scheduled. Increase the time slot by 1. Analysis Poor locality 1 extra time slot is used. No temperature problems
14
14 Pure Temperature Aware Scheduling Original Schedule
15
15 Pure Locality Aware Scheduling Algorithm Start with a clean slate. For each iteration chunk Schedule it on the processor with greatest locality with it keeping at most two chunks together. If more slots are required (when all processors are exhausted), increase the scheduling length. Otherwise move to the next processor Analysis Very good locality However 2 extra time slots are used. No temperature problems
16
16 Locality and temperature aware scheduling Algorithm Use temperature aware scheduling to obtain the schedulable slots. Use locality aware scheduling to assign chunks to these slots. TimeP0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 1■■■■■ 2■■■■■ 3■■■■■ 4■■■■■ 5■■■■ 6■■■■■ 7■ C = { I 0, I 1, I 2, I 3, I 4 } TimeP0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 10481216 215202427 3913172125 426101428 518222629 637111519 723 C = { } Analysis - Best of both worlds Great Locality No temperature problems Good performance
17
17 Phase1 - Profiling #define N 5000 #define ITER 1int du1[N], du2[N], du3[N];int au1[N][N][2], au2[N][N][2], au3[N][N][2];int a11=1, a12=-1, a13=-1; int a21=2, a22=3, a23=-3; int a31=5, a32=-5, a33=-2; int l;/* Initialization loop */ int sig = 1;int main(){ int kx; int ky; int kz;printf("Thread:%d\n",mp_numthreads( )); for(kx = 0; kx < N; kx = kx + 1) { for(ky = 0; ky < N; ky = ky + 1) { for(kz = 0; kz <= 1; kz = kz + 1) { au1[kx][ky][kz] = 1; au2[kx][ky][kz] = 1; au3[kx][ky][kz] = 1; } }} }} /* main */ Cycle Times Chunk Sizes Energy Consumption Architecture Details Temperature Sensitive Schedule + Scheduler HotSpot Phase 2 -Temperature Sensitive Scheduling Phase 3 - Locality Based Scheduling Temperature & Locality Sensitive Schedule Scheduler #define N 5000 #define ITER 1int du1[N], du2[N], du3[N];int au1[N][N][2], au2[N][N][2], au3[N][N][2];int a11=1, a12=-1, a13=-1; int a21=2, a22=3, a23=-3; int a31=5, a32=-5, a33=-2; int l;/* Initialization loop */ int sig = 1;int main(){ int kx; int ky; int kz;printf("Thread:%d\n",mp_numthreads()); for(kx = 0; kx < N; kx = kx + 1) { for(ky = 0; ky < N; ky = ky + 1) { for(kz = 0; kz <= 1; kz = kz + 1) { au1[kx][ky][kz] = 1; au2[kx][ky][kz] = 1; au3[kx][ky][kz] = 1; } }} }} /* main */ Optimized, temperature sensitive code + Code Generator Phase 4 - Code Generation Omega Library
18
18 Experiments 5 codes loop intensive codes were tested BenchmarkCycles (millions) Energy ( J) 3step-log14871894686.2 Adi4381239551.1 Btrix135180918.1 Eflux5680918.1 Tsf17992548001.6
19
19 adi - Threshold Temperature 88 ºC
20
20 eflux - Threshold Temperature 88 ºC
21
21 adi - Threshold Temperature 88 ºC
22
22 eflux - Threshold Temperature 88 ºC
23
23 Sensitivity Analysis adi - Threshold Temperature 87 ºC
24
24 Sensitivity Analysis adi - Threshold Temperature 86 ºC
25
25 Sensitivity Analysis adi - Threshold Temperature 85 ºC
26
26 Sensitivity Analysis adi - Threshold Temperature 84 ºC
27
27 Experiments
28
28 Experiments
29
29 Conclusion Implemented a compiler directed combined temperature sensitive and performance aware scheduling algorithm. Achieve impressive average and peak chip temperature reductions. This allows software to take up the burden of preventing chip damage due to thermal effects. Chips can be aggressively scaled Cooling costs can be reduced Lowers the need for hardware based thermal management schemes.
30
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.