EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 9, 2011.

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 9, 2011

- 1 - Announcements + Reading Material v 2 nd paper review due today »Should have submitted to andrew.eecs.umich.edu:/y/submit v Next Monday – Midterm exam in class v Today’s class reading »“Automatic Thread Extraction with Decoupled Software Pipelining,” G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005. v Next class reading (Wednes Nov 16) »“Spice: Speculative Parallel Iteration Chunk Execution,” E. Raman, N. Vachharajani, R. Rangan, and D. I. August, Proc 2008 Intl. Symposium on Code Generation and Optimization, April 2008.

- 2 - Midterm Exam v When: Monday, Nov 14, 2011, 10:40-12:30 v Where »1005 EECS  Uniquenames starting with A-H go here »3150 Dow (our classroom)  Uniquenames starting with I-Z go here v What to expect »Open book/notes, no laptops »Apply techniques we discussed in class on examples »Reason about solving compiler problems – why things are done »A couple of thinking problems »No LLVM code »Reasonably long but you should finish v Last 2 years exams are posted on the course website »Note – Past exams may not accurately predict future exams!!

- 3 - Midterm Exam v Office hours between now and Monday if you have questions »Daya: Thurs and Fri 3-5pm »Scott: Wednes 4:30-5:30, Fri 4:30-5:30 v Studying »Yes, you should study even though its open notes  Lots of material that you have likely forgotten  Refresh your memories  No memorization required, but you need to be familiar with the material to finish the exam »Go through lecture notes, especially the examples! »If you are confused on a topic, go through the reading »If still confused, come talk to me or Daya »Go through the practice exams as the final step

- 4 - Exam Topics v Control flow analysis »Control flow graphs, Dom/pdom, Loop detection »Trace selection, superblocks v Predicated execution »Control dependence analysis, if-conversion, hyperblocks »Can ignore control height reduction v Dataflow analysis »Liveness, reaching defs, DU/UD chains, available defs/exprs »Static single assignment v Optimizations »Classical: Dead code elim, constant/copy prop, CSE, LICM, induction variable strength reduction »ILP optimizations - unrolling, renaming, tree height reduction, induction/accumulator expansion »Speculative optimization – like HW 1

- 5 - Exam Topics - Continued v Acyclic scheduling »Dependence graphs, Estart/Lstart/Slack, list scheduling »Code motion across branches, speculation, exceptions v Software pipelining »DSA form, ResMII, RecMII, modulo scheduling »Make sure you can modulo schedule a loop! »Execution control with LC, ESC v Register allocation »Live ranges, graph coloring v Research topics »Can ignore these

- 6 - Last Class v Scientific codes – Successful parallelization »KAP, SUIF, Parascope, gcc w/ Graphite »Affine array dependence analysis »DOALL parallelization v C programs »Not dominated by array accesses – classic parallelization fails »Speculative parallelization – Hydra, Stampede, Speculative multithreading  Profiling to identify statistical DOALL loops  But not all loops DOALL, outer loops typically not!! v This class – Parallelizing loops with dependences

- 7 - What About Non-Scientific Codes??? for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // X while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X Scientific Codes (FORTRAN-like) General-purpose Codes (legacy C/C++) 0 1 2 3 4 5 LD:1 X:1LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5LD:6 Cyclic Multithreading (CMT) Example: DOACROSS [Cytron, ICPP 86] Independent Multithreading (IMT) Example: DOALL parallelization 0 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1Core 2Core 1Core 2

- 8 - Alternative Parallelization Approaches 0 1 2 3 4 5 LD:1 X:1LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5LD:6 Core 1Core 2 while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X 0 1 2 3 4 5 LD:1 LD:2X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6X:5 Core 1Core 2 Pipelined Multithreading (PMT) Example: DSWP [PACT 2004] Cyclic Multithreading (CMT)

- 9 - Comparison: IMT, PMT, CMT 0 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1Core 2 0 1 2 3 4 5 LD:1 X:1LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5LD:6 Core 1Core 2 CMTIMT 0 1 2 3 4 5 LD:1 LD:2X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6X:5 Core 1Core 2 PMT

- 10 - Comparison: IMT, PMT, CMT 0 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1Core 2 IMT 1 iter/cycle 0 1 2 3 4 5 LD:1 LD:2X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6X:5 Core 1Core 2 PMT 1 iter/cycle lat(comm) = 1: 0 1 2 3 4 5 LD:1 X:1LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5LD:6 Core 1Core 2 CMT 1 iter/cycle

- 11 - Comparison: IMT, PMT, CMT 0 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1Core 2 IMT 0 1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 Core 1Core 2 PMT 1 iter/cycle lat(comm) = 1: 1 iter/cycle lat(comm) = 2: 0.5 iter/cycle 1 iter/cycle 0 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:3 X:3 Core 1Core 2 CMT

- 12 - Comparison: IMT, PMT, CMT 0 1 2 3 4 5 C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1Core 2 IMT 0 1 2 3 4 5 LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 Core 1Core 2 PMT 0 1 2 3 4 5 LD:1 X:1 LD:2 X:2 LD:3 X:3 Core 1Core 2 CMT Cross-thread Dependences  Wide Applicability Thread-local Recurrences  Fast Execution

- 13 - Our Objective: Automatic Extraction of Pipeline Parallelism using DSWP Find English Sentences Parse Sentences (95%) Emit Results Decoupled Software Pipelining PS-DSWP (Spec DOALL Middle Stage) 197.parser

Decoupled Software Pipelining

- 15 - Decoupled Software Pipelining (DSWP) A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Inter-thread communication latency is a one-time cost intra-iteration loop-carried register control communication queue [MICRO 2005] Dependence Graph DAG SCC Thread 1Thread 2 D B C A A D B C A D B C

- 16 - Implementing DSWP L1: Aux: DFG intra-iteration loop-carried register memory control

- 17 - Optimization: Node Splitting To Eliminate Cross Thread Control L1 L2 intra-iteration loop-carried register memory control

- 18 - Optimization: Node Splitting To Reduce Communication L1 L2 intra-iteration loop-carried register memory control

- 19 - Constraint: Strongly Connected Components Solution: DAG SCC Consider: intra-iteration loop-carried register memory control Eliminates pipelined/decoupled property

- 20 - 2 Extensions to the Basic Transformation v Speculation »Break statistically unlikely dependences »Form better-balanced pipelines v Parallel Stages »Execute multiple copies of certain “large” stages »Stages that contain inner loops perfect candidates

- 21 - Why Speculation? A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph D B C A DAG SCC A D B C intra-iteration loop-carried register control communication queue

- 22 - Why Speculation? A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph D B C A DAG SCC A D B C A B C D Predictable Dependences intra-iteration loop-carried register control communication queue

- 23 - Why Speculation? A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph D B C A DAG SCC D B C Predictable Dependences A intra-iteration loop-carried register control communication queue

- 24 - Execution Paradigm Misspeculation detected DAG SCC D B C A Misspeculation Recovery Rerun Iteration 4 intra-iteration loop-carried register control communication queue

- 25 - Understanding PMT Performance 0 1 2 3 4 5 A:1 A:2B:1 B:2 B:3 B:4 A:3 A:4 A:5 A:6B:5 Core 1Core 2 0 1 2 3 4 5 A:1 B:1 C:1 C:3 A:2 B:2 A:3 B:3 Core 1Core 2 I dle Time 1 cycle/iterSlowest thread: Iteration Rate:1 iter/cycle 2 cycle/iter 0.5 iter/cycle 1.Rate t i is at least as large as the longest dependence recurrence. 2.NP-hard to find longest recurrence. 3.Large loops make problem difficult in practice.

- 26 - Selecting Dependences To Speculate A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph D B C A DAG SCC D B C A Thread 1 Thread 2 Thread 3 Thread 4 intra-iteration loop-carried register control communication queue

- 27 - Detecting Misspeculation DAG SCC D B C A A 1 : while(consume(4)) D : node = node->next produce({0,1},node); Thread 1 A 3 : while(consume(6)) B 3 : ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 A 2 : while(consume(5)) B : ncost = doit(node); produce(2,ncost); D 2 : node = consume(0); Thread 2 A : while(cost < T && node) B 4 : cost = consume(3); C 4 : node = consume(1); produce({4,5,6},cost < T && node); Thread 4

- 28 - Detecting Misspeculation DAG SCC D B C A A 1 : while(TRUE) D : node = node->next produce({0,1},node); Thread 1 A 3 : while(TRUE) B 3 : ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 A 2 : while(TRUE) B : ncost = doit(node); produce(2,ncost); D 2 : node = consume(0); Thread 2 A : while(cost < T && node) B 4 : cost = consume(3); C 4 : node = consume(1); produce({4,5,6},cost < T && node); Thread 4

- 29 - Detecting Misspeculation DAG SCC D B C A A 1 : while(TRUE) D : node = node->next produce({0,1},node); Thread 1 A 3 : while(TRUE) B 3 : ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 A 2 : while(TRUE) B : ncost = doit(node); produce(2,ncost); D 2 : node = consume(0); Thread 2 A : while(cost < T && node) B 4 : cost = consume(3); C 4 : node = consume(1); if(!(cost < T && node)) FLAG_MISSPEC(); Thread 4

- 30 - Breaking False Memory Dependences Memory Version 3 Memory Version 3 Oldest Version Committed by Recovery Thread Dependence Graph D B C A intra-iteration loop-carried register control communication queue false memory

- 31 - Adding Parallel Stages to DSWP LD = 1 cycle X = 2 cycles while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X Throughput DSWP: 1/2 iteration/cycle DOACROSS: 1/2 iteration/cycle PS-DSWP: 1 iteration/cycle Comm. Latency = 2 cycles 0 1 2 3 4 5 LD:1 LD:2 X:1 X:3 LD:3 LD:4 LD:5 LD:6 X:5 6 7 LD:7 LD:8 Core 1Core 2Core 3 X:2 X:4

- 32 - p = list; sum = 0; A: while (p != NULL) { B: id = p->id; E: q = p->inner_list; C: if (!visited[id]) { D: visited[id] = true; F: while (foo(q))‏ G: q = q->next; H: if (q != NULL)‏ I: sum += p->value; } J: p = p->next; } 10 5 5 50 5 3 Reduction Thread Partitioning

- 33 - Thread Partitioning: DAG SCC

- 34 - Thread Partitioning Merging Invariants No cycles No loop-carried dependence inside a doall node 20 10 15 5 100 5 3

- 35 - Treated as sequential 20 10 15 113 Thread Partitioning

- 36 - 45 113 Thread Partitioning v Modified MTCG[Ottoni, MICRO’05] to generate code from partition

- 37 - Discussion Point 1 – Speculation v How do you decide what dependences to speculate? »Look solely at profile data? »What about code structure? v How do you manage speculation in a pipeline? »Traditional definition of a transaction is broken »Transaction execution spread out across multiple cores

- 38 - Discussion Point 2 – Pipeline Structure v When is a pipeline a good/bad choice for parallelization? v Is pipelining good or bad for cache performance? »Is DOALL better/worse for cache? v Can a pipeline be adjusted when the number of available cores increases/decreases?

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 9, 2011.

Similar presentations

Presentation on theme: "EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 9, 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 9, 2011.

Similar presentations

Presentation on theme: "EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 9, 2011."— Presentation transcript:

Similar presentations

About project

Feedback