EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 9, 2011.

Slides:

Advertisements

Similar presentations

A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Automatic Parallelization Nick Johnson COS 597c Parallelism 30 Nov

Course Outline Traditional Static Program Analysis Software Testing

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Compiler techniques for exposing ILP

Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.

Instruction-Level Parallelism (ILP)

Components of representation Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements occur in right order Data.

Program Representations. Representing programs Goals.

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

Cpeg421-08S/final-review1 Course Review Tom St. John.

EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Class canceled next Tuesday. Recap: Components of IR Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements.

Multiscalar processors

CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.

1 Compiling with multicore Jeehyung Lee Spring 2009.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

Generic Software Pipelining at the Assembly Level Markus Pister

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Are New Languages Necessary for Manycore? David I. August Department of Computer Science Princeton University.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Dataflow IV: Loop Optimizations, Exam 2 Review EECS 483 – Lecture 26 University of Michigan Wednesday, December 6, 2006.

EECS 583 – Class 16 Research Topic 1 Automatic Parallelization University of Michigan November 7, 2012.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

High-Level Transformations for Embedded Computing

Automatically Exploiting Cross- Invocation Parallelism Using Runtime Information Jialu Huang, Thomas B. Jablin, Stephen R. Beard, Nick P. Johnson, and.

EECS 583 – Class 15 Register Allocation University of Michigan November 2, 2011.

EECS 583 – Class 18 Research Topic 1 Breaking Dependences, Dynamic Parallelization University of Michigan November 21, 2011.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.

CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

EECS 583 – Class 18 Research Topic 1 Breaking Dependences, Dynamic Parallelization University of Michigan November 14, 2012.

Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.

Automatic Thread Extraction with Decoupled Software Pipelining

Computer Architecture Principles Dr. Mike Frank

/ Computer Architecture and Design

CSL718 : VLIW - Software Driven ILP

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

Princeton University Spring 2016

EECS 583 – Class 15 Register Allocation

Compiler techniques for exposing ILP (cont)

Code Optimization Overview and Examples Control Flow Graph

Exam Topics Hal Perkins Autumn 2009

Instruction Level Parallelism (ILP)

EECS 583 – Class 14 Modulo Scheduling Reloaded

University of Michigan November 7, 2018

The Vector-Thread Architecture

EECS 583 – Class 15 Register Allocation

How to improve (decrease) CPI

EECS 583 – Class 9 Classic and ILP Optimization

Loop-Level Parallelism

Pointer analysis John Rollinson & Kaiyuan Li

Exam Topics Hal Perkins Winter 2008

Presentation transcript:

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 9, 2011

- 1 - Announcements + Reading Material v 2 nd paper review due today »Should have submitted to andrew.eecs.umich.edu:/y/submit v Next Monday – Midterm exam in class v Today’s class reading »“Automatic Thread Extraction with Decoupled Software Pipelining,” G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov v Next class reading (Wednes Nov 16) »“Spice: Speculative Parallel Iteration Chunk Execution,” E. Raman, N. Vachharajani, R. Rangan, and D. I. August, Proc 2008 Intl. Symposium on Code Generation and Optimization, April 2008.

- 2 - Midterm Exam v When: Monday, Nov 14, 2011, 10:40-12:30 v Where »1005 EECS  Uniquenames starting with A-H go here »3150 Dow (our classroom)  Uniquenames starting with I-Z go here v What to expect »Open book/notes, no laptops »Apply techniques we discussed in class on examples »Reason about solving compiler problems – why things are done »A couple of thinking problems »No LLVM code »Reasonably long but you should finish v Last 2 years exams are posted on the course website »Note – Past exams may not accurately predict future exams!!

- 3 - Midterm Exam v Office hours between now and Monday if you have questions »Daya: Thurs and Fri 3-5pm »Scott: Wednes 4:30-5:30, Fri 4:30-5:30 v Studying »Yes, you should study even though its open notes  Lots of material that you have likely forgotten  Refresh your memories  No memorization required, but you need to be familiar with the material to finish the exam »Go through lecture notes, especially the examples! »If you are confused on a topic, go through the reading »If still confused, come talk to me or Daya »Go through the practice exams as the final step

- 4 - Exam Topics v Control flow analysis »Control flow graphs, Dom/pdom, Loop detection »Trace selection, superblocks v Predicated execution »Control dependence analysis, if-conversion, hyperblocks »Can ignore control height reduction v Dataflow analysis »Liveness, reaching defs, DU/UD chains, available defs/exprs »Static single assignment v Optimizations »Classical: Dead code elim, constant/copy prop, CSE, LICM, induction variable strength reduction »ILP optimizations - unrolling, renaming, tree height reduction, induction/accumulator expansion »Speculative optimization – like HW 1

- 5 - Exam Topics - Continued v Acyclic scheduling »Dependence graphs, Estart/Lstart/Slack, list scheduling »Code motion across branches, speculation, exceptions v Software pipelining »DSA form, ResMII, RecMII, modulo scheduling »Make sure you can modulo schedule a loop! »Execution control with LC, ESC v Register allocation »Live ranges, graph coloring v Research topics »Can ignore these

- 6 - Last Class v Scientific codes – Successful parallelization »KAP, SUIF, Parascope, gcc w/ Graphite »Affine array dependence analysis »DOALL parallelization v C programs »Not dominated by array accesses – classic parallelization fails »Speculative parallelization – Hydra, Stampede, Speculative multithreading  Profiling to identify statistical DOALL loops  But not all loops DOALL, outer loops typically not!! v This class – Parallelizing loops with dependences

- 7 - What About Non-Scientific Codes??? for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // X while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X Scientific Codes (FORTRAN-like) General-purpose Codes (legacy C/C++) LD:1 X:1LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5LD:6 Cyclic Multithreading (CMT) Example: DOACROSS [Cytron, ICPP 86] Independent Multithreading (IMT) Example: DOALL parallelization C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1Core 2Core 1Core 2

- 8 - Alternative Parallelization Approaches LD:1 X:1LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5LD:6 Core 1Core 2 while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X LD:1 LD:2X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6X:5 Core 1Core 2 Pipelined Multithreading (PMT) Example: DSWP [PACT 2004] Cyclic Multithreading (CMT)

- 9 - Comparison: IMT, PMT, CMT C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1Core LD:1 X:1LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5LD:6 Core 1Core 2 CMTIMT LD:1 LD:2X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6X:5 Core 1Core 2 PMT

Comparison: IMT, PMT, CMT C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1Core 2 IMT 1 iter/cycle LD:1 LD:2X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6X:5 Core 1Core 2 PMT 1 iter/cycle lat(comm) = 1: LD:1 X:1LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5LD:6 Core 1Core 2 CMT 1 iter/cycle

Comparison: IMT, PMT, CMT C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1Core 2 IMT LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 Core 1Core 2 PMT 1 iter/cycle lat(comm) = 1: 1 iter/cycle lat(comm) = 2: 0.5 iter/cycle 1 iter/cycle LD:1 X:1 LD:2 X:2 LD:3 X:3 Core 1Core 2 CMT

Comparison: IMT, PMT, CMT C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6 Core 1Core 2 IMT LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 Core 1Core 2 PMT LD:1 X:1 LD:2 X:2 LD:3 X:3 Core 1Core 2 CMT Cross-thread Dependences  Wide Applicability Thread-local Recurrences  Fast Execution

Our Objective: Automatic Extraction of Pipeline Parallelism using DSWP Find English Sentences Parse Sentences (95%) Emit Results Decoupled Software Pipelining PS-DSWP (Spec DOALL Middle Stage) 197.parser

Decoupled Software Pipelining

Decoupled Software Pipelining (DSWP) A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Inter-thread communication latency is a one-time cost intra-iteration loop-carried register control communication queue [MICRO 2005] Dependence Graph DAG SCC Thread 1Thread 2 D B C A A D B C A D B C

Implementing DSWP L1: Aux: DFG intra-iteration loop-carried register memory control

Optimization: Node Splitting To Eliminate Cross Thread Control L1 L2 intra-iteration loop-carried register memory control

Optimization: Node Splitting To Reduce Communication L1 L2 intra-iteration loop-carried register memory control

Constraint: Strongly Connected Components Solution: DAG SCC Consider: intra-iteration loop-carried register memory control Eliminates pipelined/decoupled property

Extensions to the Basic Transformation v Speculation »Break statistically unlikely dependences »Form better-balanced pipelines v Parallel Stages »Execute multiple copies of certain “large” stages »Stages that contain inner loops perfect candidates

Why Speculation? A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph D B C A DAG SCC A D B C intra-iteration loop-carried register control communication queue

Why Speculation? A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph D B C A DAG SCC A D B C A B C D Predictable Dependences intra-iteration loop-carried register control communication queue

Why Speculation? A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph D B C A DAG SCC D B C Predictable Dependences A intra-iteration loop-carried register control communication queue

Execution Paradigm Misspeculation detected DAG SCC D B C A Misspeculation Recovery Rerun Iteration 4 intra-iteration loop-carried register control communication queue

Understanding PMT Performance A:1 A:2B:1 B:2 B:3 B:4 A:3 A:4 A:5 A:6B:5 Core 1Core A:1 B:1 C:1 C:3 A:2 B:2 A:3 B:3 Core 1Core 2 I dle Time 1 cycle/iterSlowest thread: Iteration Rate:1 iter/cycle 2 cycle/iter 0.5 iter/cycle 1.Rate t i is at least as large as the longest dependence recurrence. 2.NP-hard to find longest recurrence. 3.Large loops make problem difficult in practice.

Selecting Dependences To Speculate A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph D B C A DAG SCC D B C A Thread 1 Thread 2 Thread 3 Thread 4 intra-iteration loop-carried register control communication queue

Detecting Misspeculation DAG SCC D B C A A 1 : while(consume(4)) D : node = node->next produce({0,1},node); Thread 1 A 3 : while(consume(6)) B 3 : ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 A 2 : while(consume(5)) B : ncost = doit(node); produce(2,ncost); D 2 : node = consume(0); Thread 2 A : while(cost < T && node) B 4 : cost = consume(3); C 4 : node = consume(1); produce({4,5,6},cost < T && node); Thread 4

Detecting Misspeculation DAG SCC D B C A A 1 : while(TRUE) D : node = node->next produce({0,1},node); Thread 1 A 3 : while(TRUE) B 3 : ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 A 2 : while(TRUE) B : ncost = doit(node); produce(2,ncost); D 2 : node = consume(0); Thread 2 A : while(cost < T && node) B 4 : cost = consume(3); C 4 : node = consume(1); produce({4,5,6},cost < T && node); Thread 4

Detecting Misspeculation DAG SCC D B C A A 1 : while(TRUE) D : node = node->next produce({0,1},node); Thread 1 A 3 : while(TRUE) B 3 : ncost = consume(2); C : cost += ncost; produce(3,cost); Thread 3 A 2 : while(TRUE) B : ncost = doit(node); produce(2,ncost); D 2 : node = consume(0); Thread 2 A : while(cost < T && node) B 4 : cost = consume(3); C 4 : node = consume(1); if(!(cost < T && node)) FLAG_MISSPEC(); Thread 4

Breaking False Memory Dependences Memory Version 3 Memory Version 3 Oldest Version Committed by Recovery Thread Dependence Graph D B C A intra-iteration loop-carried register control communication queue false memory

Adding Parallel Stages to DSWP LD = 1 cycle X = 2 cycles while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X Throughput DSWP: 1/2 iteration/cycle DOACROSS: 1/2 iteration/cycle PS-DSWP: 1 iteration/cycle Comm. Latency = 2 cycles LD:1 LD:2 X:1 X:3 LD:3 LD:4 LD:5 LD:6 X:5 6 7 LD:7 LD:8 Core 1Core 2Core 3 X:2 X:4

p = list; sum = 0; A: while (p != NULL) { B: id = p->id; E: q = p->inner_list; C: if (!visited[id]) { D: visited[id] = true; F: while (foo(q))‏ G: q = q->next; H: if (q != NULL)‏ I: sum += p->value; } J: p = p->next; } Reduction Thread Partitioning

Thread Partitioning: DAG SCC

Thread Partitioning Merging Invariants No cycles No loop-carried dependence inside a doall node

Treated as sequential Thread Partitioning

Thread Partitioning v Modified MTCG[Ottoni, MICRO’05] to generate code from partition

Discussion Point 1 – Speculation v How do you decide what dependences to speculate? »Look solely at profile data? »What about code structure? v How do you manage speculation in a pipeline? »Traditional definition of a transaction is broken »Transaction execution spread out across multiple cores

Discussion Point 2 – Pipeline Structure v When is a pipeline a good/bad choice for parallelization? v Is pipelining good or bad for cache performance? »Is DOALL better/worse for cache? v Can a pipeline be adjusted when the number of available cores increases/decreases?