Computer Systems Research at UNT 1 A Multithreaded Architecture Krishna Kavi (with minor modifcations)

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
A scheme to overcome data hazards
Compiler techniques for exposing ILP
Lecture 6: Multicore Systems
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
PZ13A Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ13A - Processor design Programming Language Design.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
Chapter Hardwired vs Microprogrammed Control Multithreading
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Chapter 17 Parallel Processing.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Computer Systems Research at UNT 1 Computer Systems Research Faculty:Krishna Kavi Phil Sweany Hubert Bahr Saraju Mohanty Hao Li Recent PhD Graduate:Mehran.
Processes April 5, 2000 Instructor: Gary Kimura Slides courtesy of Hank Levy.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
Architecture Basics ECE 454 Computer Systems Programming
Computer System Architectures Computer System Software
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 6: April 19, 2001 Dataflow.
Basics and Architectures
High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Implementing Processes and Process Management Brian Bershad.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
RISC Architecture RISC vs CISC Sherwin Chan.
COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Module : Algorithmic state machines. Machine language Machine language is built up from discrete statements or instructions. On the processing architecture,
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 17: May 14, 2003 Threaded Abstract Machine.
Classic Model of Parallel Processing
A few issues on the design of future multicores André Seznec IRISA/INRIA.
M. Mateen Yaqoob The University of Lahore Spring 2014.
Processor Architecture
Pipelining and Parallelism Mark Staveley
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Computer Systems Research at UNT 1 Billion Transistor Chips Multicore Low Power Architectures Krishna Kavi Department of Computer Science and Engineering.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
EKT303/4 Superscalar vs Super-pipelined.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Chapter One Introduction to Pipelined Processors.
Compiler Research How I spent my last 22 summer vacations Philip Sweany.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.
1 Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Topics to be covered Instruction Execution Characteristics
Lecture Topics: 11/1 Processes Process Management
Assembly Language for Intel-Based Computers, 5th Edition
Architecture Background
Computer Architecture: Multithreading (I)
Superscalar Processors & VLIW Processors
Hardware Multithreading
Processes Hank Levy 1.
The Vector-Thread Architecture
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Processes Hank Levy 1.
Research: Past, Present and Future
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Presentation transcript:

Computer Systems Research at UNT 1 A Multithreaded Architecture Krishna Kavi (with minor modifcations)

Computer Systems Research at UNT 2 More CPU’s per chip -- Multi-core systems More threads per core -- hyper-threading More cache and cache levels (L1, L2, L3) System on a chip and Network on chip Hybrid system including reconfigurable logic Billion Transistor Chips How to garner the silicon real-estate for improved performance?

Computer Systems Research at UNT 3 Areas of research Multithreaded Computer Architecture Memory Systems and Cache Memory Optimizations Low Power Optimizations Compiler Optimizations Parallel and Distributed Processing

Computer Systems Research at UNT 4 LOTS of simple processors on a chip Scheduled Dataflow (SDF) Based upon dataflow principles Supports high degree of multithreading Separates memory access from execution pipeline “Scales” well in performance Our Solution How to garner the silicon real-estate for improved performance?

Computer Systems Research at UNT 5 Computer Architecture Research A new multithreaded architecture called Scheduled Dataflow(SDF) Uses Non-Blocking Multithreaded Model Decouples Memory access from execution pipelines Uses in-order execution model (less hardware complexity) The simpler hardware of SDF may lend itself better for embedded applications with stringent power requirements

Computer Systems Research at UNT 6 Overview of our multithreaded SDF  Based on our past work with Dataflow and Functional Architectures  Non-Blocking Multithreaded Architecture  Contains multiple functional units like superscalar and other multithreaded systems  Contains multiple register contexts like other multithreaded systems  Decoupled Access - Execute Architecture  Completely separates memory accesses from execution pipeline

Computer Systems Research at UNT 7 Background How does a program run on a typical computer? A program is translated into machine (or assembly) language The program instructions and data are stored in memory (DRAM) The program is then executed by ‘fetching’ one instruction at a time The instruction to be fetched is controlled by a special pointer called program counter If an instruction is a branch or jump, the program counter is changed to the address of the target of the branch

Computer Systems Research at UNT 8 Dataflow Model MIPS like instructions 1. LOAD R2, A / load A into R2 2. LOAD R3, B / load B into R3 3. ADD R11, R2, R3 / R11 = A+B 4. LOAD R4, X / load X into R4 5. LOAD R5, Y / load Y into R5 6. ADD R10, R4, R5 / R10 = X+Y 7. SUB R12, R4, R5 / R12 = X-Y 8. MULT R14, R10, R11 / R14 = (X+Y)*(A+B) 9. DIV R15, R12, R11 / R15 = (X-Y)/(A+B) 10. STORE...., R14 / store first result 11. STORE....., R15 / store second result Pure Dataflow Instructions 1: LOAD 3L / load A, send to Instruction 3 2: LOAD 3R / load B, send to Instruction 3 3: ADD 8R, 9R / A+B, send to Instructions 8 and 9 4: LOAD 6L, 7L / load X, send to Instructions 6 and 7 5: LOAD 6R, 7R / load Y, send to Instructions 6 and 7 6: ADD 8L / X+Y, send to Instructions 8 7: SUB 9L / X-Y, send to Instruction 9 8: MULT 10L / (X+Y)*(A+B), send to Instruction 10 9: DIV 11L / (X-Y)/(A+B), send to Instruction 11 10: STORE / store first result 11: STORE / store second result

Computer Systems Research at UNT 9 We use dataflow model at thread level Instructions within a thread are executed sequentially We also call this non-blocking thread model SDF Dataflow Model

Computer Systems Research at UNT 10 Blocking vs Non-Blocking Thread Models Traditional multithreaded systems use blocking models A thread is blocked (or preempted) A blocked thread is switched out and execution resumes in future In some cases, the resources of a blocked thread (including register context) may be assigned to other awaiting threads. Blocking models require more context switches In a non-blocking model, once a thread begins execution, it will not be stopped (or preempted) before it completes execution

Computer Systems Research at UNT 11 Non-Blocking Threads Most functional and dataflow systems use non-blocking threads A thread/code block is enabled when all its inputs are available.A scheduled thread will run to completion.

Computer Systems Research at UNT 12 Cilk Programming Example thread fib (cont int k, int n) { if (n<2) send_argument (k, n) else{ cont int x, y; spawn_next sum (k, ?x, ?y); /* create a successor thread spawn fib (x, n-1); /* fork a child thread spawn fib (y, n-2); /* fork a child thread }} thread sum (cont int k, int x, int y) {send_argument (k, x+y);} /* return results to parent’s /*successor

Computer Systems Research at UNT 13 Decoupled Architectures Separate memory accesses from execution Separate Processor to handle all memory accesses The earliest suggestion by J.E. Smith -- DAE architecture

Computer Systems Research at UNT 14 Limitations of DAE Architecture Designed for system with no pipelines Single instruction stream Instructions for Execute processor must be coordinated with the data accesses performed by Access processor Very tight synchronization needed Coordinating conditional branches complicates the design Generation of coordinated instruction streams for Execute and Access my prevent traditional compiler optimizations

Computer Systems Research at UNT 15 Our Decoupled Architecture Multithreading along with decoupling Group all LOAD instructions together at the head of a thread Pre-load thread’s data into registers before scheduling for execution During execution the thread does not access memory Post-store thread completes execution Separate memory accesses into Pre-load and Post-store threads

Computer Systems Research at UNT 16 Pre-Load and Post-Store LDF0, 0(R1)LDF0, 0(R1) LDF6, -8(R1)LDF6, -8(R1) MULTDF0, F0, F2LD F4, 0(R2) MULTDF6, F6, F2LDF8, -8(R2) LD F4, 0(R2)MULTDF0, F0, F2 LDF8, -8(R2)MULTDF6, F6, F2 ADDDF0, F0, F4SUBIR2, R2, 16 ADDDF6, F6, F8SUBIR1, R1, 16 SUBIR2, R2, 16ADDDF0, F0, F4 SUBIR1, R1, 16ADDDF6, F6, F8 SD8(R2), F0SD8(R2), F0 BNEZR1, LOOPSD0(R2), F6 SD0(R2), F6 Conventional New Architecture

Computer Systems Research at UNT 17 Features Of Our Decoupled System No execution pipeline bubbles due to cache misses Overlapped execution of threads Opportunities for better data placement and prefetching Fine-grained threads -- A limitation? Multiple hardware contexts add to hardware complexity If 35% of instructions are memory access instructions, PL/PS can achieve 35% performance benefit with sufficient thread parallelism and completely mask memory access delays!

Computer Systems Research at UNT 18 Conditional Statements in SDF Execute EQRR2, R4 / compare R2 and R3, Result in R4 NOT R4, R5 / Complement of R4 in R5 FALLOC “Then_Thread” / Create Then Thread (Allocate Frame memory, Set Synch-Count, FALLOC “Else_Thread” / Create Else Thread (Allocate Frame memory, Set Synch-Count, FORKSP R4, “Then_Store”/If X=Y, get ready post-store “Then_Thread” FORKSP R5, “Else_Store”/Else, get ready pre-store “Else_Thread” STOP Pre-Load LOAD RFP| 2, R2 / load X into R2 LOAD RFP| 3, R3 / load Y into R3 / frame pointers for returning results / frame offsets for returning results In Then_Thread, We de-allocate (FFREE) the Else_Thread and vice-versa

Computer Systems Research at UNT 19 SDF Architecture Execute Processor (EP)Memory Access Pipeline Synchronization Processor (SP)

Computer Systems Research at UNT 20 Execution of SDF Programs Preload Poststore Thread0 Thread 2 Preload Execute Poststore Execute Poststore Thread 3 Thread 4 SP =PL/PS EP=EX Execute Preload Poststore Execute Poststore Thread 1

Computer Systems Research at UNT 21 Some Performance Results Scalability of SDF (Matrix) Number of Functional Units Execution Cycles In OrderOut of OrderSDF Scalability of SDF (FFT) Number of Functional Units Execution Cycles In OrderOut of OrderSDF Scalability of SDF (Zoom) Number of Functional Units Execution Cycles In OrderOut of OrderSDF