HARP Control Divergence & Assignment 4

Slides:

Advertisements

Similar presentations

Fetch Execute Cycle – In Detail -

Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

10/9: Lecture Topics Starting a Program Exercise 3.2 from H+P Review of Assembly Language RISC vs. CISC.

CIS 314 Fall 2005 MIPS Datapath (Single Cycle and Multi-Cycle)

1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.

LC-3 Computer LC-3 Instructions

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

Chapters 5 - The LC-3 LC-3 Computer Architecture Memory Map

Chapters 4 & 5: LC-3 Computer Architecture Machine Instructions Assembly language Programming in Machine and Assembly Language.

Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

Lec 15Systems Architecture1 Systems Architecture Lecture 15: A Simple Implementation of MIPS Jeremy R. Johnson Anatole D. Ruslanov William M. Mongan Some.

ECE 445 – Computer Organization

CDA 3101 Fall 2013 Introduction to Computer Organization

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Introduction to Computer Organization Pipelining.

1  2004 Morgan Kaufmann Publishers No encoding: –1 bit for each datapath operation –faster, requires more memory (logic) –used for Vax 780 — an astonishing.

STUDY OF PIC MICROCONTROLLERS.. Design Flow C CODE Hex File Assembly Code Compiler Assembler Chip Programming.

CS 352H: Computer Systems Architecture

Electrical and Computer Engineering University of Cyprus

15-740/ Computer Architecture Lecture 3: Performance

Computer Architecture Instruction Set Architecture

Chapter 1: A Tour of Computer Systems

CSCI206 - Computer Organization & Programming

Morgan Kaufmann Publishers

/ Computer Architecture and Design

Morgan Kaufmann Publishers The Processor

Morgan Kaufmann Publishers

RISC Concepts, MIPS ISA Logic Design Tutorial 8.

Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.

Morgan Kaufmann Publishers The Processor

Instructor: Justin Hsia

Single Clock Datapath With Control

Pipeline Implementation (4.6)

Decode and Operand Read

CDA 3101 Spring 2016 Introduction to Computer Organization

Henk Corporaal TUEindhoven 2009

Super Quick Architecture Review

Morgan Kaufmann Publishers The Processor

CSCI206 - Computer Organization & Programming

Morgan Kaufmann Publishers The Processor

CSC 3210 Computer Organization and Programming

The University of Adelaide, School of Computer Science

CSCI206 - Computer Organization & Programming

Morgan Kaufmann Publishers The Processor

Topic 5: Processor Architecture Implementation Methodology

Rocky K. C. Chang 6 November 2017

Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Henk Corporaal TUEindhoven 2011

Guest Lecturer TA: Shreyas Chand

Topic 5: Processor Architecture

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Midterm 2 review Chapter

The Heterogeneous Architecture Research Prototype (HARP)

Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.

Instruction Set Principles

ECE 498AL Lecture 10: Control Flow

CSC3050 – Computer Architecture

Morgan Kaufmann Publishers The Processor

ECE 498AL Spring 2010 Lecture 10: Control Flow

Loop-Level Parallelism

Guest Lecturer: Justin Hsia

Predication ECE 721 Prof. Rotenberg.

Chapter 4 The Von Neumann Model

Presentation transcript:

HARP Control Divergence & Assignment 4 Blaise Tine Georgia Institute of Technology

Questions? Agenda Harp Control Divergence Assignment 4 Predication Split-Join Assignment 4 Codebase Clone Barriers Samples Walkthrough Questions?

Two techniques supported by ISA: Predication Control Divergence Two techniques supported by ISA: Predication Control branch divergence at instruction granularity Split-Join Control branch divergence at block granularity

Harp Predication Full Predication Implementation All instructions can be predicated Implementation Separate predicate register file All predicated instructions execute Fetch => Decode => Execute Conditional Commit stage Only instructions with predicate value ‘true’

Harp Predication (2) Compiler Support Example If-conversion: Converts control dependencies into data dependencies Example Set predicate if (r1) { ++r2; } else { --r2; } rtop @p0, %r1 @p0 ? addi %r2, %r2, #1 ntop @p0, @p0 @p0 ? Subi %r2, %r2, #1 Inverse predicate

Predicate Value Test Instructions Harp Predication (3) Predicate Value Test Instructions rtop @dst %src isneg @dst %src iszero @dst %src Predicate Manipulation Instructions ntop @dst @src0 andp @dst @src0 @src1 orp @dst @src0 @src1 xorp @dst @src0 @src1

Harp Predication (4) Advantages Limitations No branching overhead Simple microarchitecture Limitations If-conversion is not always possible e.g. loops, indirect branches Inefficient with unanimous branches Both paths are always executed

Hardware stack management Compiler support Harp Split-Join ISA Support @p split: partition a warp using predicate mask, each subset taking different target join: merge partitioned subset into single execution block Implementation Hardware stack management Compiler support

Harp Split-Join (2) Example Set predicate NPC mask rtop @p0, %r1 @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example push PC and mask onto HW stack NPC mask @2 1001 @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example Execute threads with ‘true’ predicate NPC mask @2 1001 @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example Execute threads with ‘true’ predicate NPC mask @2 1001 @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example Pop HW stack and jmp to @2 NPC mask @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example Execute threads with ‘false’ predicate NPC mask @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example Execute threads with ‘false’ predicate NPC mask @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example Execute threads with ‘false’ predicate NPC mask @7 0110 rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Example Pop HW stack and jmp to @7 NPC mask rtop @p0, %r1 @p0 ? split @p0 ? jmp then subi %r2, %r2, #1 jmp next then: addi %r2, %r2, #1 next: join if (r1) { ++r2; } else { --r2; }

Harp Split-Join (2) Advantages Challenges Efficient with unanimous branches Only a single path is executed The active mask turns off inactive threads Challenges Complex microarchitecture HW stack manager Split-jmp-Join overhead

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Mini Harp Minimal ISA Word encoding Integers only A single predicate register No Split-Join No warps creation No interrupts No virtual addressing Instructions Set Nop, Add, Sub, And, Or, Xor, Not, Shr, Shl, Ld, St, Jmp, Jal, Bar Configuration Register size, warp size, number of warps Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Code base Shared header Common.h // common includes and definitions Utility Library utils.cpp/h // utility functions Core classes mem.cpp/h // memory lrucache.cpp/h // cache Instr.cpp/h // instruction decode. cpp/h // decoder regfile.h // register file warp.cpp/h // warp unit core.cpp/h // processor core Chapter 1 — Computer Abstractions and Technology

Assignment 4: Core Initialization Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Core Initialization Program RAM Core Construction Console output Load/Store Unit ICache & DCache IDecoder Warps Chapter 1 — Computer Abstractions and Technology

Assignment 4: Memory Layout Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Memory Layout console RAM Chapter 1 — Computer Abstractions and Technology

Assignment 4: Warp Initialization Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Warp Initialization Warp Construction GP Registers Pred Registers Boot enable Chapter 1 — Computer Abstractions and Technology

Assignment 4: Warp Execute Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Warp Execute Step Function Pipeline stages Fetch Decode Chapter 1 — Computer Abstractions and Technology

Assignment 4: Warp Execute (2) Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Warp Execute (2) Execution Instructions Predication Jump instruction Set predicate Add your code! Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Clone Instruction Format clone %src0 Operation Copy current lane registers into %src0 lane. Register %src0 holds the destination lane index. e.g. ldi %r0, #2 clone %r0 # copy current registers into 3rd lane. Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Barrier Instruction Format bar %src0, %src1 Operation Synchronize %src1 number of warps with barrier identifier %src0. Register %src0 holds the barrier id (supported max value is 3). Register %src1 holds the number of warps to wait on. e.g. ldi %r0, #1 ldi %r1, # 2 bar %r0, %r1 # insert a size-2 named barrier with id=1 Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: Testing Emulator command line ./miniharp.out –r #regs –t #threads –w #warps –o #output Sample programs $ ./miniharp.out hello.bin -t 4 -w 1 -r 8 -o output.log $ ./miniharp.out sum.bin -t 4 -w 1 -r 8 -o output.log $ ./miniharp.out barrier.bin -t 4 -w 4 -r 8 -o output.log Output format “<Program Output>” “Instruction Count: <?>” Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: runtime.s Print Hex Print String Print NewLine Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: hello.s Load string Call prints Exit String data Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: sum.s Clone Registers Parallel Call Print result0 Array data Output address Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers April 3, 2019 Assignment 4: barrier.s Start new Warp Barrier Single warp Print results Chapter 1 — Computer Abstractions and Technology

Questions? Questions?