HARP Control Divergence & Assignment 4

Slides:



Advertisements
Similar presentations
Fetch Execute Cycle – In Detail -
Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
10/9: Lecture Topics Starting a Program Exercise 3.2 from H+P Review of Assembly Language RISC vs. CISC.
CIS 314 Fall 2005 MIPS Datapath (Single Cycle and Multi-Cycle)
1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.
LC-3 Computer LC-3 Instructions
Chapter 12 Pipelining Strategies Performance Hazards.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
Chapters 5 - The LC-3 LC-3 Computer Architecture Memory Map
Chapters 4 & 5: LC-3 Computer Architecture Machine Instructions Assembly language Programming in Machine and Assembly Language.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Lec 15Systems Architecture1 Systems Architecture Lecture 15: A Simple Implementation of MIPS Jeremy R. Johnson Anatole D. Ruslanov William M. Mongan Some.
ECE 445 – Computer Organization
CDA 3101 Fall 2013 Introduction to Computer Organization
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Introduction to Computer Organization Pipelining.
1  2004 Morgan Kaufmann Publishers No encoding: –1 bit for each datapath operation –faster, requires more memory (logic) –used for Vax 780 — an astonishing.
STUDY OF PIC MICROCONTROLLERS.. Design Flow C CODE Hex File Assembly Code Compiler Assembler Chip Programming.
CS 352H: Computer Systems Architecture
Electrical and Computer Engineering University of Cyprus
15-740/ Computer Architecture Lecture 3: Performance
Computer Architecture Instruction Set Architecture
Chapter 1: A Tour of Computer Systems
CSCI206 - Computer Organization & Programming
Morgan Kaufmann Publishers
/ Computer Architecture and Design
Morgan Kaufmann Publishers The Processor
Morgan Kaufmann Publishers
RISC Concepts, MIPS ISA Logic Design Tutorial 8.
Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.
Morgan Kaufmann Publishers The Processor
Instructor: Justin Hsia
Single Clock Datapath With Control
Pipeline Implementation (4.6)
Decode and Operand Read
CDA 3101 Spring 2016 Introduction to Computer Organization
Henk Corporaal TUEindhoven 2009
Super Quick Architecture Review
Morgan Kaufmann Publishers The Processor
CSCI206 - Computer Organization & Programming
Morgan Kaufmann Publishers The Processor
CSC 3210 Computer Organization and Programming
The University of Adelaide, School of Computer Science
CSCI206 - Computer Organization & Programming
Morgan Kaufmann Publishers The Processor
Topic 5: Processor Architecture Implementation Methodology
Rocky K. C. Chang 6 November 2017
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Henk Corporaal TUEindhoven 2011
Guest Lecturer TA: Shreyas Chand
Topic 5: Processor Architecture
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Midterm 2 review Chapter
The Heterogeneous Architecture Research Prototype (HARP)
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
Instruction Set Principles
ECE 498AL Lecture 10: Control Flow
CSC3050 – Computer Architecture
Morgan Kaufmann Publishers The Processor
ECE 498AL Spring 2010 Lecture 10: Control Flow
Loop-Level Parallelism
Guest Lecturer: Justin Hsia
Predication ECE 721 Prof. Rotenberg.
Chapter 4 The Von Neumann Model
Presentation transcript:

HARP Control Divergence & Assignment 4 Blaise Tine Georgia Institute of Technology

Questions? Agenda Harp Control Divergence Assignment 4 Predication Split-Join Assignment 4 Codebase Clone Barriers Samples Walkthrough Questions?

Two techniques supported by ISA: Predication Control Divergence Two techniques supported by ISA: Predication Control branch divergence at instruction granularity Split-Join Control branch divergence at block granularity

Harp Predication Full Predication Implementation All instructions can be predicated Implementation Separate predicate register file All predicated instructions execute Fetch => Decode => Execute Conditional Commit stage Only instructions with predicate value ‘true’

Harp Predication (2) Compiler Support Example If-conversion: Converts control dependencies into data dependencies Example Set predicate if (r1) { ++r2; } else { --r2; } rtop @p0, %r1 @p0 ? addi %r2, %r2, #1 ntop @p0, @p0 @p0 ? Subi %r2, %r2, #1 Inverse predicate

Predicate Value Test Instructions Harp Predication (3) Predicate Value Test Instructions rtop @dst %src isneg @dst %src iszero @dst %src Predicate Manipulation Instructions ntop @dst @src0 andp @dst @src0 @src1 orp @dst @src0 @src1 xorp @dst @src0 @src1

Harp Predication (4) Advantages Limitations No branching overhead Simple microarchitecture Limitations If-conversion is not always possible e.g. loops, indirect branches Inefficient with unanimous branches Both paths are always executed

Hardware stack management Compiler support Harp Split-Join ISA Support @p split: partition a warp using predicate mask, each subset taking different target join: merge partitioned subset into single execution block Implementation Hardware stack management Compiler support

Harp Split-Join (2) Example Set predicate NPC mask rtop @p0, %r1 @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example push PC and mask onto HW stack NPC mask @2 1001 @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example Execute threads with ‘true’ predicate NPC mask @2 1001 @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example Execute threads with ‘true’ predicate NPC mask @2 1001 @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example Pop HW stack and jmp to @2 NPC mask @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example Execute threads with ‘false’ predicate NPC mask @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example Execute threads with ‘false’ predicate NPC mask @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example Execute threads with ‘false’ predicate NPC mask @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example Pop HW stack and jmp to @7 NPC mask rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Advantages Challenges Efficient with unanimous branches Only a single path is executed The active mask turns off inactive threads Challenges Complex microarchitecture HW stack manager Split-jmp-Join overhead

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Mini Harp Minimal ISA Word encoding Integers only A single predicate register No Split-Join No warps creation No interrupts No virtual addressing Instructions Set Nop, Add, Sub, And, Or, Xor, Not, Shr, Shl, Ld, St, Jmp, Jal, Bar Configuration Register size, warp size, number of warps Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Code base Shared header Common.h // common includes and definitions Utility Library utils.cpp/h // utility functions Core classes mem.cpp/h // memory lrucache.cpp/h // cache Instr.cpp/h // instruction decode. cpp/h // decoder regfile.h // register file warp.cpp/h // warp unit core.cpp/h // processor core Chapter 1 — Computer Abstractions and Technology

Assignment 4: Core Initialization Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Core Initialization Program RAM Core Construction Console output Load/Store Unit ICache & DCache IDecoder Warps Chapter 1 — Computer Abstractions and Technology

Assignment 4: Memory Layout Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Memory Layout console RAM Chapter 1 — Computer Abstractions and Technology

Assignment 4: Warp Initialization Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Warp Initialization Warp Construction GP Registers Pred Registers Boot enable Chapter 1 — Computer Abstractions and Technology

Assignment 4: Warp Execute Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Warp Execute Step Function Pipeline stages Fetch Decode Chapter 1 — Computer Abstractions and Technology

Assignment 4: Warp Execute (2) Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Warp Execute (2) Execution Instructions Predication Jump instruction Set predicate Add your code! Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Clone Instruction Format clone %src0 Operation Copy current lane registers into %src0 lane. Register %src0 holds the destination lane index. e.g. ldi %r0, #2 clone %r0 # copy current registers into 3rd lane. Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Barrier Instruction Format bar %src0, %src1 Operation Synchronize %src1 number of warps with barrier identifier %src0. Register %src0 holds the barrier id (supported max value is 3). Register %src1 holds the number of warps to wait on. e.g. ldi %r0, #1 ldi %r1, # 2 bar %r0, %r1 # insert a size-2 named barrier with id=1 Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Testing Emulator command line ./miniharp.out –r #regs –t #threads –w #warps –o #output Sample programs $ ./miniharp.out hello.bin -t 4 -w 1 -r 8 -o output.log $ ./miniharp.out sum.bin -t 4 -w 1 -r 8 -o output.log $ ./miniharp.out barrier.bin -t 4 -w 4 -r 8 -o output.log Output format “<Program Output>” “Instruction Count: <?>” Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: runtime.s Print Hex Print String Print NewLine Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: hello.s Load string Call prints Exit String data Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: sum.s Clone Registers Parallel Call Print result0 Array data Output address Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: barrier.s Start new Warp Barrier Single warp Print results Chapter 1 — Computer Abstractions and Technology

Questions? Questions?