Jonathan Mak & Alan Mycroft University of Cambridge

Slides:



Advertisements
Similar presentations
1 Lecture 3: MIPS Instruction Set Today’s topic:  More MIPS instructions  Procedure call/return Reminder: Assignment 1 is on the class web-page (due.
Advertisements

Computer Architecture Instruction-Level Parallel Processors
The University of Adelaide, School of Computer Science
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
10/6: Lecture Topics Procedure call Calling conventions The stack
Procedures in more detail. CMPE12cGabriel Hugh Elkaim 2 Why use procedures? –Code reuse –More readable code –Less code Microprocessors (and assembly languages)
Procedure Calls Prof. Sirer CS 316 Cornell University.
Lecture 6: MIPS Instruction Set Today’s topic –Control instructions –Procedure call/return 1.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Computer Architecture CSCE 350
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /17/2013 Lecture 12: Procedures Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE CENTRAL.
Ch. 8 Functions.
The University of Adelaide, School of Computer Science
Apr. 12, 2000Systems Architecture I1 Systems Architecture I (CS ) Lecture 6: Branching and Procedures in MIPS* Jeremy R. Johnson Wed. Apr. 12, 2000.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Procedures in more detail. CMPE12cCyrus Bazeghi 2 Procedures Why use procedures? Reuse of code More readable Less code Microprocessors (and assembly languages)
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
CS 536 Spring Code generation I Lecture 20.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Lecture 6: Procedures (cont.). Procedures Review Called with a jal instruction, returns with a jr $ra Accepts up to 4 arguments in $a0, $a1, $a2 and $a3.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Intro to Computer Architecture
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.
Lecture 7: MIPS Instruction Set Today’s topic –Procedure call/return –Large constants Reminders –Homework #2 posted, due 9/17/
INTEL CONFIDENTIAL Finding Parallelism Introduction to Parallel Programming – Part 3.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and notes from the Patterson and Hennessy Text.
Procedures. Why use procedures? ? Microprocessors (and assembly languages) provide only minimal support for procedures Must build a standard form for.
Runtime Stack Computer Organization I 1 November 2009 © McQuain, Feng & Ribbens MIPS Memory Organization In addition to memory for static.
Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.
ITCS 3181 Logic and Computer Systems 2015 B. Wilkinson Slides4-2.ppt Modification date: March 23, Procedures Essential ingredient of high level.
ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
Chapter 2 — Instructions: Language of the Computer — 1 Conditional Operations Branch to a labeled instruction if a condition is true – Otherwise, continue.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and notes from the Patterson and Hennessy Text.
Rocky K. C. Chang Version 0.1, 25 September 2017
Chapter 4: Multithreaded Programming
Computer structure: Procedure Calls
Computer Architecture Principles Dr. Mike Frank
Procedures (Functions)
CC 423: Advanced Computer Architecture Limits to ILP
Chapter 4: Threads.
Calling Conventions Hakim Weatherspoon CS 3410, Spring 2012
The University of Adelaide, School of Computer Science
10/4: Lecture Topics Overflow and underflow Logical operations
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Program and memory layout
Procedures and Calling Conventions
Chapter 4: Threads & Concurrency
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Program and memory layout
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Computer Architecture
Program and memory layout
Where is all the knowledge we lost with information? T. S. Eliot
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Program and memory layout
Presentation transcript:

Jonathan Mak & Alan Mycroft University of Cambridge Finding Limits of Parallelism using Dynamic Dependency Graphs – How much parallelism is out there? Jonathan Mak & Alan Mycroft University of Cambridge Wednesday, 14 November 2018Wednesday, 14 November 2018 WODA 2009, Chicago

Motivation Moore’s Law, Multi-core and end of the “Free Lunch” We need programs to be parallel Source: Herb Sutter. A fundamental turn toward concurrency in software. Dr. Dobb’s Journal, 30(3):16–20, March 2005.

Explicit Parallelism Implicit Parallelism Two approaches Specified by programmer E.g. OpenMP, Java, MPI, Cilk, TBB, Join calculus Too hard for the average programmer? Extracted by compiler E.g. Polaris [Blume+ 94], Dependence analysis [Kennedy 02], DSWP [Ottoni 05], GREMIO [Ottoni 07]

Implicit Parallelism – What’s the limit? Existing implementations evaluated on small number of cores/processors (<10) Speed-up rises with #procs But how far can we go? Limits of Instruction-level parallelism first explored by [Wall 93] Assume: No threading overheads Inter-thread communication is free Perfect alias analysis Perfect oracle for dependence analysis

Types of Dependencies Name dependencies add $4, $5, $6 sub $2, $3, $4 True dependencies (RAW) add $4, $5, $6 sub $2, $3, $4 False dependencies (WAR) add $4, $5, $6 sub $6, $2, $3 Control dependencies beq $2, $3, L L:... Output dependencies (WAW) add $4, $5, $6 sub $4, $2, $3

Dynamic Dependency Graph

Benchmarks (mostly miBench) MIPS executables Implementation Benchmarks (mostly miBench) gcc + μClibc MIPS executables QEMU Instruction Traces DDG builder Dynamic Dependency Graphs

Effects of Control dependencies

Effects of Control dependencies Restricts parallelism to within (dynamic) basic block Parallelism <10 in most cases Already exploited in multiple-issue processors Good news #1: Good branch prediction is not difficult But only applies locally, examining at most 10s of instructions in advance Good news #2: Control flow merge points not considered here E.g. if R1 then { R2 } else { R3 } R4 Static analysis would help us remove such dependencies

True dependencies only Can speculate away control dependencies Some name dependencies are compiler artifacts Caused by memory being reused by unrelated calculations True dependencies represent essence of algorithm

True dependencies only

Spaghetti stack – removing more compiler artifacts Some dependencies on execution stack are compiler- induced Inter-frame name dependencies True dependencies on stack pointer 1 jal foo # main: call foo() 2 addiu $sp,$sp,-32 # foo: decrement stack pointer (new frame) 3 addu $fp,$0,$sp # copy stack pointer to frame pointer ... <code for foo()>... 4 addu $sp,$0,$fp # copy frame pointer to stack pointer 5 addiu $sp,$sp,32 # increment stack pointer (discard frame) 6 jr $ra # return to main() 7 jal bar # main: call bar() 8 addiu $sp,$sp,-32 # bar: decrement stack pointer (new frame) 9 addu $fp,$0,$sp # copy stack pointer to frame pointer ... <code for bar()>... 10 addu $sp,$0,$fp # copy frame pointer to stack pointer 11 addiu $sp,$sp,32 # increment stack pointer (discard frame) 12 jr $ra # return to main() void main() { foo(); bar(); }

Spaghetti stack – removing more compiler artifacts Linear stack Spaghetti stack

Spaghetti stack – removing more compiler artifacts alloc frame SP++ free frame SP−− alloc frame SP++ free frame

Spaghetti Stack

What about other compiler artifacts? Stack pointer is just one example Calls to malloc() is another Extreme case – remove all address calculation nodes from the graph

Ignoring all Address calculations

Conclusions Control dependencies are the biggest obstacle to getting parallelism above 10  Control speculation Most programs exhibit parallelism >100 when only true dependencies (essence of algorithm) are considered Spaghetti stack removes certain compiler-induced true dependencies, further doubling the parallelism in some cases Good figures, but realising such parallelism remains a challenge

Future work Scale up analysis framework Bigger, more complex benchmarks (e.g. web/DB server, etc.) How does parallelism change when data input size grows? How much parallelism is instruction-level (ILP), and how much is task-level (TLP)? Map dependencies back to source code Paper addressing some of these questions has just been submitted

Related work Wall, “Limits of instruction-level parallelism” (1991) Lam and Wilson, “Limits of control flow on parallelism” (1992) Austin and Sohi, “Dynamic dependency analysis of ordinary programs” (1992) Postiff, Greene, Tyson and Mudge, “The limits of instruction level parallelism in SPEC95 applications” (1999) Stefanović and Martonosi, “Limits and graph structure of available instruction-level parallelism” (2001)