Download presentation
Presentation is loading. Please wait.
1
Parasol LaboratoryTexas A&M University IPDPS 20021 The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops Francis Dang, Hao Yu, and Lawrence Rauchwerger Department of Computer Science Texas A&M University
2
Parasol LaboratoryTexas A&M University IPDPS 20022 Motivation To maximize performance, extract the maximum available parallelism from loops. Static compiler methods may be insufficient. –Access patterns may be too complex. –Required information is only available at runtime. Run-time methods needed to extract loop parallelism –Inspector/Executor –Speculative Parallelization
3
Parasol LaboratoryTexas A&M University IPDPS 20023 Speculative Parallelization: LRPD Test Main Idea –Execute a loop as a DOALL. –Record memory references during execution. –Check for data dependences. –If there was a dependence, re-execute the loop sequentially. Disadvantages –One data dependence can invalidate speculative parallelization. –Slowdown is proportional to speculative parallel execution time. –Partial parallelism is not exploited.
4
Parasol LaboratoryTexas A&M University IPDPS 20024 Partially Parallel Loop Example do i = 1, 8 z = A[K[i]] A[L[i]] = z + C[i] end do K[1:8] = [1,2,3,1,4,2,1,1] L[1:8] = [4,5,5,4,3,5,3,3] iter12345678 A() 1RRRR 2RR 3RWWW 4WWR 5WWW
5
Parasol LaboratoryTexas A&M University IPDPS 20025 The Recursive LRPD Main Idea –Transform a partially parallel loop into a sequence of fully parallel, block-scheduled loops. –Iterations before the first data dependence are correct and committed. –Re-apply the LRPD test on the remaining iterations. Worst case –Sequential time plus testing overhead
6
Parasol LaboratoryTexas A&M University IPDPS 20026 Algorithm success Initialize Commit Analyze Execute as DOALL Checkpoint if failure Reinitialize Restore Restart
7
Parasol LaboratoryTexas A&M University IPDPS 20027 Implementation Implemented in run-time pass in Polaris and additional hand-inserted code. –Privatization with copy-in/copy-out for arrays under test. –Replicated buffers for reductions. –Backup arrays for checkpointing.
8
Parasol LaboratoryTexas A&M University IPDPS 20028 Recursive LRPD Example do i = 1, 8 z = A[K[i]] A[L[i]] = z + C[i] end do K[1:8] = [1,2,3,1,4,2,1,1] L[1:8] = [4,5,5,4,2,5,3,3] 7-85-63-41-2iter WWW5 RWW4 WR3 WR2 RRR1 A() P4P3P2P1proc First Stage 7-85-6iter W5 R4 W3 W2 R1 A() P4P3P2P1proc Second Stage
9
Parasol LaboratoryTexas A&M University IPDPS 20029 Heuristics Work Redistribution Sliding Window Approach Data Dependence Graph Extraction
10
Parasol LaboratoryTexas A&M University IPDPS 200210 Work Redistribution Redistribute remaining iterations across processors. Execution time for each stage will decrease. Disadvantages: –May uncover new dependences across processors. –May incur remote cache misses from data redistribution. p1p2p3 p4 1 st stage After 1 st stage 2 nd stage After 2 nd stage With Redistribution p1p2p3 p4 1 st stage After 1 st stage 2 nd stage After 2 nd stage Without Redistribution
11
Parasol LaboratoryTexas A&M University IPDPS 200211 Work Redistribution Example do i = 1, 8 z = A[K[i]] A[L[i]] = z + C[i] end do K[1:8] = [1,2,3,1,4,2,1,1] L[1:8] = [4,5,5,4,2,5,3,3] 7-85-63-41-2iter WWW5 RWW4 WR3 WR2 RRR1 A() P4P3P2P1proc First StageSecond Stage 8765iter W5 R4 WW3 RW2 RR1 A() P4P3P2P1proc Third Stage 876iter W5 4 WW3 R2 RR1 A() P4P3P2P1proc
12
Parasol LaboratoryTexas A&M University IPDPS 200212 Redistribution Model Redistribution may not always be beneficial. Stop redistribution if: –The cost of data redistribution outweighs the benefit from work redistribution. Synthetic loop to model this adaptive method.
13
Parasol LaboratoryTexas A&M University IPDPS 200213 Redistribution Model
14
Parasol LaboratoryTexas A&M University IPDPS 200214 Sliding Window R-LRPD R-LRPD can generate a sequential schedule for long dependence distributions. Strip-mine the speculative execution. Apply the R-LRPD on a contiguous block of iterations. Only dependences within the window cause failures. Adds more global synchronizations and test overhead. After 1 st stage After 2 nd stage p1 1 st stage p2 p1 2 nd stage
15
Parasol LaboratoryTexas A&M University IPDPS 200215 DDG Extraction R-LRPD can generate sequential schedules for complex dependence distributions. Use the SW R-LRPD scheme to extract the data dependence graph (DDG). Generate an optimized schedule from the DDG. Obtains the DDG for loops from which a proper inspector cannot be extracted. p1 1 st stage p2 p1 2 nd stage After 1 st stage 13 After 2 nd stage 25 34
16
Parasol LaboratoryTexas A&M University IPDPS 200216 Performance Issues Performance issues: –Blocked scheduling – potential cause for load imbalance. –Checkpointing can be expensive. Feedback guided blocked scheduling –Use the timing information from the previous instantiation (Bull, EuroPar 98) –Estimate the processor chunk sizes for minimal load imbalance. On-Demand Checkpointing –Checkpoint only data modified during execution.
17
Parasol LaboratoryTexas A&M University IPDPS 200217 Experiments Setup: –16 processor HP V-Class –4 GB memory –HP-UX 11.0 DCDCMP_do15 DCDCMP_do70 BJT SPICE 2G6 Quadrilateral LoopFMA3D NLFILT_do300 EXTEND_do400 FPTRAK_do300 TRACK Codes and Loops:
18
Parasol LaboratoryTexas A&M University IPDPS 200218 Experimental Results – Input Profiles
19
Parasol LaboratoryTexas A&M University IPDPS 200219 Experimental Results - TRACK
20
Parasol LaboratoryTexas A&M University IPDPS 200220 Experimental Results - TRACK
21
Parasol LaboratoryTexas A&M University IPDPS 200221 Experimental Results - TRACK
22
Parasol LaboratoryTexas A&M University IPDPS 200222 Experimental Results - TRACK
23
Parasol LaboratoryTexas A&M University IPDPS 200223 Experimental Results – Sliding Window
24
Parasol LaboratoryTexas A&M University IPDPS 200224 Experimental Results – Sliding Window
25
Parasol LaboratoryTexas A&M University IPDPS 200225 Experimental Results – FMA3D
26
Parasol LaboratoryTexas A&M University IPDPS 200226 Experimental Results – SPICE 2G6
27
Parasol LaboratoryTexas A&M University IPDPS 200227 Conclusion Contribution: –Can speculatively parallelize any loop. –Concern is now optimizing the parallelization and not when to parallelize. Future work: –Use dependence distribution information for adaptive redistribution and scheduling.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.