Download presentation
Presentation is loading. Please wait.
Published byHannah Julia Bennett Modified over 9 years ago
1
EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 9, 2011
2
- 1 - Announcements + Reading Material v 2 nd paper review due today »Should have submitted to andrew.eecs.umich.edu:/y/submit v Next Monday – Midterm exam in class v Today’s class reading »“Automatic Thread Extraction with Decoupled Software Pipelining,” G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005. v Next class reading (Wednes Nov 16) »“Spice: Speculative Parallel Iteration Chunk Execution,” E. Raman, N. Vachharajani, R. Rangan, and D. I. August, Proc 2008 Intl. Symposium on Code Generation and Optimization, April 2008.
3
- 2 - Midterm Exam v When: Monday, Nov 14, 2011, 10:40-12:30 v Where »1005 EECS Uniquenames starting with A-H go here »3150 Dow (our classroom) Uniquenames starting with I-Z go here v What to expect »Open book/notes, no laptops »Apply techniques we discussed in class on examples »Reason about solving compiler problems – why things are done »A couple of thinking problems »No LLVM code »Reasonably long but you should finish v Last 2 years exams are posted on the course website »Note – Past exams may not accurately predict future exams!!
4
- 3 - Midterm Exam v Office hours between now and Monday if you have questions »Daya: Thurs and Fri 3-5pm »Scott: Wednes 4:30-5:30, Fri 4:30-5:30 v Studying »Yes, you should study even though its open notes Lots of material that you have likely forgotten Refresh your memories No memorization required, but you need to be familiar with the material to finish the exam »Go through lecture notes, especially the examples! »If you are confused on a topic, go through the reading »If still confused, come talk to me or Daya »Go through the practice exams as the final step
5
- 4 - Exam Topics v Control flow analysis »Control flow graphs, Dom/pdom, Loop detection »Trace selection, superblocks v Predicated execution »Control dependence analysis, if-conversion, hyperblocks »Can ignore control height reduction v Dataflow analysis »Liveness, reaching defs, DU/UD chains, available defs/exprs »Static single assignment v Optimizations »Classical: Dead code elim, constant/copy prop, CSE, LICM, induction variable strength reduction »ILP optimizations - unrolling, renaming, tree height reduction, induction/accumulator expansion »Speculative optimization – like HW 1
6
- 5 - Exam Topics - Continued v Acyclic scheduling »Dependence graphs, Estart/Lstart/Slack, list scheduling »Code motion across branches, speculation, exceptions v Software pipelining »DSA form, ResMII, RecMII, modulo scheduling »Make sure you can modulo schedule a loop! »Execution control with LC, ESC v Register allocation »Live ranges, graph coloring v Research topics »Can ignore these
7
- 6 - Last Class v Scientific codes – Successful parallelization »KAP, SUIF, Parascope, gcc w/ Graphite »Affine array dependence analysis »DOALL parallelization v C programs »Not dominated by array accesses – classic parallelization fails »Speculative parallelization – Hydra, Stampede, Speculative multithreading Profiling to identify statistical DOALL loops But not all loops DOALL, outer loops typically not!! v This class – Parallelizing loops with dependences
8
- 7 - What About Non-Scientific Codes??? for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // X while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X Scientific Codes (FORTRAN-like) General-purpose Codes (legacy C/C++) 0 1 2 3 4 5 LD:1 X:1LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5LD:6 Cyclic Multithreading (CMT) Example: DOACROSS [Cytron, ICPP 86] Independent Multithreading (IMT) Example: DOALL parallelization 0 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1Core 2Core 1Core 2
9
- 8 - Alternative Parallelization Approaches 0 1 2 3 4 5 LD:1 X:1LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5LD:6 Core 1Core 2 while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X 0 1 2 3 4 5 LD:1 LD:2X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6X:5 Core 1Core 2 Pipelined Multithreading (PMT) Example: DSWP [PACT 2004] Cyclic Multithreading (CMT)
10
- 9 - Comparison: IMT, PMT, CMT 0 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1Core 2 0 1 2 3 4 5 LD:1 X:1LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5LD:6 Core 1Core 2 CMTIMT 0 1 2 3 4 5 LD:1 LD:2X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6X:5 Core 1Core 2 PMT
11
- 10 - Comparison: IMT, PMT, CMT 0 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1Core 2 IMT 1 iter/cycle 0 1 2 3 4 5 LD:1 LD:2X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6X:5 Core 1Core 2 PMT 1 iter/cycle lat(comm) = 1: 0 1 2 3 4 5 LD:1 X:1LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5LD:6 Core 1Core 2 CMT 1 iter/cycle
12
- 11 - Comparison: IMT, PMT, CMT 0 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1Core 2 IMT 0 1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 Core 1Core 2 PMT 1 iter/cycle lat(comm) = 1: 1 iter/cycle lat(comm) = 2: 0.5 iter/cycle 1 iter/cycle 0 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:3 X:3 Core 1Core 2 CMT
13
- 12 - Comparison: IMT, PMT, CMT 0 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1Core 2 IMT 0 1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 Core 1Core 2 PMT 0 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:3 X:3 Core 1Core 2 CMT Cross-thread Dependences Wide Applicability Thread-local Recurrences Fast Execution
14
- 13 - Our Objective: Automatic Extraction of Pipeline Parallelism using DSWP Find English Sentences Parse Sentences (95%) Emit Results Decoupled Software Pipelining PS-DSWP (Spec DOALL Middle Stage) 197.parser
15
Decoupled Software Pipelining
16
- 15 - Decoupled Software Pipelining (DSWP) A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Inter-thread communication latency is a one-time cost intra-iteration loop-carried register control communication queue [MICRO 2005] Dependence Graph DAG SCC Thread 1Thread 2 D B C A A D B C A D B C
17
- 16 - Implementing DSWP L1: Aux: DFG intra-iteration loop-carried register memory control
18
- 17 - Optimization: Node Splitting To Eliminate Cross Thread Control L1 L2 intra-iteration loop-carried register memory control
19
- 18 - Optimization: Node Splitting To Reduce Communication L1 L2 intra-iteration loop-carried register memory control
20
- 19 - Constraint: Strongly Connected Components Solution: DAG SCC Consider: intra-iteration loop-carried register memory control Eliminates pipelined/decoupled property
21
- 20 - 2 Extensions to the Basic Transformation v Speculation »Break statistically unlikely dependences »Form better-balanced pipelines v Parallel Stages »Execute multiple copies of certain “large” stages »Stages that contain inner loops perfect candidates
22
- 21 - Why Speculation? A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph D B C A DAG SCC A D B C intra-iteration loop-carried register control communication queue
23
- 22 - Why Speculation? A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph D B C A DAG SCC A D B C A B C D Predictable Dependences intra-iteration loop-carried register control communication queue
24
- 23 - Why Speculation? A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph D B C A DAG SCC D B C Predictable Dependences A intra-iteration loop-carried register control communication queue
25
- 24 - Execution Paradigm Misspeculation detected DAG SCC D B C A Misspeculation Recovery Rerun Iteration 4 intra-iteration loop-carried register control communication queue
26
- 25 - Understanding PMT Performance 0 1 2 3 4 5 A:1 A:2B:1 B:2 B:3 B:4 A:3 A:4 A:5 A:6B:5 Core 1Core 2 0 1 2 3 4 5 A:1 B:1 C:1 C:3 A:2 B:2 A:3 B:3 Core 1Core 2 I dle Time 1 cycle/iterSlowest thread: Iteration Rate:1 iter/cycle 2 cycle/iter 0.5 iter/cycle 1.Rate t i is at least as large as the longest dependence recurrence. 2.NP-hard to find longest recurrence. 3.Large loops make problem difficult in practice.
27
- 26 - Selecting Dependences To Speculate A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph D B C A DAG SCC D B C A Thread 1 Thread 2 Thread 3 Thread 4 intra-iteration loop-carried register control communication queue
28
- 27 - Detecting Misspeculation DAG SCC D B C A A 1 : while(consume(4)) D : node = node->next produce({0,1},node); Thread 1 A 3 : while(consume(6)) B 3 : ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 A 2 : while(consume(5)) B : ncost = doit(node); produce(2,ncost); D 2 : node = consume(0); Thread 2 A : while(cost < T && node) B 4 : cost = consume(3); C 4 : node = consume(1); produce({4,5,6},cost < T && node); Thread 4
29
- 28 - Detecting Misspeculation DAG SCC D B C A A 1 : while(TRUE) D : node = node->next produce({0,1},node); Thread 1 A 3 : while(TRUE) B 3 : ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 A 2 : while(TRUE) B : ncost = doit(node); produce(2,ncost); D 2 : node = consume(0); Thread 2 A : while(cost < T && node) B 4 : cost = consume(3); C 4 : node = consume(1); produce({4,5,6},cost < T && node); Thread 4
30
- 29 - Detecting Misspeculation DAG SCC D B C A A 1 : while(TRUE) D : node = node->next produce({0,1},node); Thread 1 A 3 : while(TRUE) B 3 : ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 A 2 : while(TRUE) B : ncost = doit(node); produce(2,ncost); D 2 : node = consume(0); Thread 2 A : while(cost < T && node) B 4 : cost = consume(3); C 4 : node = consume(1); if(!(cost < T && node)) FLAG_MISSPEC(); Thread 4
31
- 30 - Breaking False Memory Dependences Memory Version 3 Memory Version 3 Oldest Version Committed by Recovery Thread Dependence Graph D B C A intra-iteration loop-carried register control communication queue false memory
32
- 31 - Adding Parallel Stages to DSWP LD = 1 cycle X = 2 cycles while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X Throughput DSWP: 1/2 iteration/cycle DOACROSS: 1/2 iteration/cycle PS-DSWP: 1 iteration/cycle Comm. Latency = 2 cycles 0 1 2 3 4 5 LD:1 LD:2 X:1 X:3 LD:3 LD:4 LD:5 LD:6 X:5 6 7 LD:7 LD:8 Core 1Core 2Core 3 X:2 X:4
33
- 32 - p = list; sum = 0; A: while (p != NULL) { B: id = p->id; E: q = p->inner_list; C: if (!visited[id]) { D: visited[id] = true; F: while (foo(q)) G: q = q->next; H: if (q != NULL) I: sum += p->value; } J: p = p->next; } 10 5 5 50 5 3 Reduction Thread Partitioning
34
- 33 - Thread Partitioning: DAG SCC
35
- 34 - Thread Partitioning Merging Invariants No cycles No loop-carried dependence inside a doall node 20 10 15 5 100 5 3
36
- 35 - Treated as sequential 20 10 15 113 Thread Partitioning
37
- 36 - 45 113 Thread Partitioning v Modified MTCG[Ottoni, MICRO’05] to generate code from partition
38
- 37 - Discussion Point 1 – Speculation v How do you decide what dependences to speculate? »Look solely at profile data? »What about code structure? v How do you manage speculation in a pipeline? »Traditional definition of a transaction is broken »Transaction execution spread out across multiple cores
39
- 38 - Discussion Point 2 – Pipeline Structure v When is a pipeline a good/bad choice for parallelization? v Is pipelining good or bad for cache performance? »Is DOALL better/worse for cache? v Can a pipeline be adjusted when the number of available cores increases/decreases?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.