Download presentation
Presentation is loading. Please wait.
Published byTamsyn Horton Modified over 9 years ago
1
Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan † and Todd C. Mowry School of Computer Science Carnegie Mellon University † Dept. Elec. & Comp. Engineering University of Toronto
2
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 2 - Motivation Chip-level multiprocessing is becoming commonplace We need parallel programs UntraSPARC IV 2 UltraSparc III cores IBM Power 4 SUN MAJC Sibyte SB-1250 Can multithreaded processors improve the performance of a single application?
3
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 3 - Why Is Automatic Parallelization Difficult? One solution: Thread-Level Speculation Automatic parallelization today Must statically prove threads are independent Constructing proofs is difficult due to ambiguous data dependences Complex control flow Pointers and indirect references Runtime inputs Optimistic compiler? Limited only by true dependences
4
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 4 - Example while (...){ … x=hash[index1]; … hash[index2]=y;... } Time … = hash[19] … hash[21] =... check_dep() Thread 2 … = hash[33] … hash[30] =... check_dep() Thread 3 … = hash[3] … hash[10] =... check_dep() Thread 1 … = hash[10] … hash[25] =... check_dep() Thread 4 … = hash[31] … hash[12] =... check_dep() Thread 5 … = hash[9] … hash[44] =... check_dep() Thread 6 … = hash[27] … hash[32] =... check_dep() Thread 7 … = hash[10] … hash[25] =... check_dep() Thread 4 Retry Processor 1Processor 2Processor 3Processor 4
5
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 5 - Frequently Dependent Scalars …=a a=… …=a a=… Can identify scalars that always cause dependences Time Producer Consumer
6
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 6 - Frequently Dependent Scalars …=a a=… …=a a=… Dependent scalars should be synchronized [ASPLOS’02] Time Signal(a) Wait(a) Producer Consumer
7
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 7 - Frequently Dependent Scalars …=a a=… Dataflow analysis allows us to deal with complex control flow [ASPLOS’02] …=a a=… Time Producer Consumer
8
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 8 - Communicating Memory-Resident Values Synchronize? Speculate? Will speculation succeed? Time Load *p Store *q Load *p Store *q Producer Consumer
9
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 9 - Speculation vs. Synchronization Sequential ExecutionSpeculative Parallel Execution Load *p Speculation succeeds: efficient Time Load *p Store *q Load *p Store *q
10
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 10 - Speculation vs. Synchronization Sequential ExecutionSpeculative Parallel Execution Speculation fails: inefficient Load *p Time Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q violation
11
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 11 - Speculation vs. Synchronization Sequential ExecutionSpeculative Parallel Execution Frequent dependences: Synchronize Infrequent dependences: Speculate Load *p Time Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q
12
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 12 - Performance Potential Reducing failed speculation improves performance Detailed simulation: TLS support 4-processor CMP 4-way issue, out-of-order superscalar 10-cycle communication latency Original Perfect memory value Prediction Norm. Regional Exec. Time 0 100 m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp go
13
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 13 - Hardware vs. Compiler Inserted Synchronization Store*q Load *p Memory Store*q Load *p Memory Store *q Load *p Memory Speculation Hardware-inserted Synchronization [HPCA’02] Compiler-inserted Synchronization [CGO’04] Time Signal() (stall) Producer Consumer Producer Consumer Producer Consumer Wait()
14
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 14 - Issues in Synchronizing Memory-Resident Values Static analysis Which instructions to synchronize? Inter-procedural dependences Runtime Detecting and recovering from improper synchronization Store *q Load *p Producer Consumer Time
15
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 15 - Outline Static analysis Runtime checks Results Conclusions Load *p Producer Consumer Store *q Time
16
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 16 - Compiler Passes Front End Back End foo.c foo.exe Insert Synchronization Profile Data Dependences Create Threads Schedule Instructions Decide what to Synchronize
17
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 17 - Example work() push (head, entry) do { push (&set, element); work(); } while (test);
18
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 18 - Example work() { if (condition(&set)) push (&set, element); } push (head, entry) do { push (&set, element); work(); } while (test);
19
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 19 - Example work() { if (condition(&set)) push (&set, element); } push(head,entry) { entry->next = *head; *head = entry; } push(head,entry) { entry->next = *head; *head = entry; } Load *head Store *head Load *head (work, push) Load *head (push) Store *head (work, push) do { push (&set, element); work(); } while (test); Store *head (push)
20
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 20 - Compiler Passes Front End Back End Insert Synchronization Profile Data Dependences Thread Creating Instruction Scheduling Decide what to Synchronize foo.exe foo.c
21
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 21 - Example work() { if (condition(&set)) push (&set, element); } do { push (&set, element); work(); } while (test); push(head,entry) { entry->next = *head; *head = entry; } push(head,entry) { entry->next = *head; *head = entry; } Load *head (push) Store *head (push) Load *head (work, push) Store *head (work, push) Profile Information ======================================================== Source Destination Frequency Store *head(push) Load *head(push) 990 Store *head(push) Load *head(work, push) 10 Store *head(work, push) Load *head(push) 10 Profile Information ======================================================== Source Destination Frequency Store *head(push) Load *head(push) 990 Store *head(push) Load *head(work, push) 10 Store *head(work, push) Load *head(push) 10
22
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 22 - Compiler Passes Front End Back End Insert Synchronization Profile Data Dependences Thread Creating Instruction Scheduling Decide what to Synchronize foo.exe foo.c
23
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 23 - Dependence Graph Load *head (work, push) Store *head (work, push) 990 10 Load *head (push) Store *head (push) Pairs that need to be synchronized can be extracted from the dependence graph Infrequent dependences: occur in less than 5% of iterations
24
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 24 - Compiler Passes Front End Back End Insert Synchronization Profile Data Dependences Thread Creating Instruction Scheduling Decide what to Synchronize foo.exe foo.c
25
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 25 - Example work() { if (condition(&set)) push (&set, element); } do { push (&set, element); work(); } while (test); push(head,entry) { entry->next = *head; *head = entry; } push(head,entry) { entry->next = *head; *head = entry; } Load *head (push) Store *head (push) 990 Load *head (push) Store *head (push) Synchronize these push_clone(head,entry) { wait(); entry->next = *head; *head = entry; signal(head, *head); } push_clone(&set, element);
26
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 26 - Outline Static analysis Runtime checks Results Conclusions Producer Consumer Store *q Load *p Time
27
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 27 - Runtime Checks Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and Load *p Signal(q, *q); Producer forwards the address to ensure a match between the load and the store Producer Consumer Load *p Store *q Time
28
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 28 - Ensuring Correctness Store *x Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and load *p Consumer Producer Hardware support Similar to memory conflict buffer [Gallagher et al, ASPLOS’94] Load *p Store *q Time
29
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 29 - Ensuring Correctness Hardware support: TLS hardware already knows which locations are stored to Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and load *p Consumer Producer Store *y Load *p Store *q Time
30
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 30 - Outline Static analysis Runtime checks Results Conclusions Producer Consumer Store *q Load *p Time
31
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 31 - Crossbar Experimental Framework Underlying architecture 4-processor, single-chip multiprocessor speculation supported through coherence Simulator superscalar, similar to MIPS R14K 10-cycle communication latency models all bandwidth and contention Benchmarks SPECint95 and SPECint2000, -O3 optimization detailed simulation C C P C P
32
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 32 - Parallel Region Coverage 0 100 go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp Coverage is significant Average coverage: 54%
33
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 33 - Failed Speculation Synchronization Stall Other Busy U=No synchronization inserted C=Compiler-Inserted Synchronization Seven benchmarks speed up by 5% to 46% Compiler-Inserted Synchronization 0 100 go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp U C U C U C U C U C U C U C U C U C U C U C U C U C 10%46%13%5%8%5%21% Norm. Regional Exec. Time
34
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 34 - Compiler- vs. Hardware-Inserted Synchronization 0 100 go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp C H C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization Compiler and hardware [HPCA’02] each benefits different benchmarks Norm. Regional Exec. Time Failed Speculation Synchronization Stall Other Busy Hardware does better Compiler does better
35
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 35 - Combining Hardware and Compiler Synchronization C=Compiler-inserted synchronization H=Hardware-inserted synchronization B=Combining Both The combination is more robust than each technique individually 0 100 go m88ksim gzip_comp gzip_decomp perlbmk gap C H B Norm. Regional Exec. Time Failed Speculation Synchronization Stall Other Busy
36
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 36 - Related Work Zhai et. al. CGO’04 Cytron ICPP’86 Compiler-inserted Moshovos et. al. ISCA’97 Cintra & Torrellas HPCA’02 Steffan et. al. HPCA’02 Hardware-inserted Centralized TableDistributed Table Tsai & Yew PACT’96
37
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 37 - Conclusions Compiler-inserted synchronization for memory-resident value communication: Effective in reducing speculation failure Half of the benchmarks speedup by 5% to 46% (regional) Combining hardware and compiler techniques is more robust Neither consistently outperforms the other Can be combined to track the best performer Memory-resident value communication should be addressed with the combined efforts of the compiler and the hardware
38
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 38 - Questions?
39
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 39 - The Potential of Instruction Scheduling 0 100 go m88ksim ijpeg gzip_comp_R gzip_decomp vpr_place mcf crafty parser perlbmk gap gzip_comp gcc E=Early C=Compiler-Inserted Synchronization L=Late Failed Speculation Synchronization Stall Other Busy Scheduling instructions has addition benefit for some benchmarks ECL Bzip2_comp
40
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 40 - Program Performance 0 100 go m88ksim ijpeg gzip_comp_R gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp bzip2_decomp twolf gzip_comp U=Un-optimized C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization B=Both compiler and hardware Failed Speculation Synchronization Stall Other Busy UCHB
41
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 41 - Which Technique Synchronizes This Load? 0 100 go m88ksim ijpeg gzip_comp_R gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp twolf UCHB gzip_comp U=Un-optimized C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization B=Both compiler and hardware Synchronized by neither technique Synchronized by compiler Synchronized by hardware Synchronized by both
42
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 42 - Ensuring Correctness Hardware support Similar to memory conflict buffer [Gallagher et al, ASPLOS’94] Store *q Load *p Store *x Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and load *p Consumer Producer
43
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 43 - Consumer Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and load *p Ensuring Correctness Hardware support Use the forwarded value only if the synchronized pair is dependent Use Forwarded Value Use Memory Value Local Store to *p q == p NO YES NO Store *q Load *p Store *x Signal(q); Signal(*q) Producer
44
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 44 - Issues in Synchronizing Memory-Resident Values Inserting synchronization using compilers Ensuring correctness Reducing synchronization cost Store *q Load *p Consumer Producer
45
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 45 - Reducing Cost of Synchronization Before Instruction Scheduling Consumer Producer Instruction scheduling algorithms are described in [ASPLOS’02] After Instruction Scheduling Producer Consumer
46
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 46 - The Potential of Instruction Scheduling 0 100 m88ksim ijpeg gzip_comp gzip_decomp vpr_place gap E = Perfectly predicting synchronized memory-resident values C = Compiler-inserted synchronization L = Consumer stalls until previous thread commits Scheduling instructions could offer additional benefit E C L Failed Speculation Synchronization Stall Other Busy Norm. Regional Exec. Time
47
Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 47 - Using More Accuracy of Profiling Information 0 100 CRU U=No Instruction Scheduling C=Compiler-Inserted Synchronization R=Compiler-Inserted Synchronization (Profiled with the ref input set) Gzip_comp is the only benchmark sensitive to profiling input gzip_comp Failed Speculation Synchronization Stall Other Busy Norm. Regional Exec. Time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.