SoC CAD 2015/11/22 1 Instruction Set Extensions for Multi-Threading in LEON3 林孟諭電機系, Department of Electrical Engineering 國立成功大學, National Cheng Kung.

Slides:

Advertisements

Similar presentations

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.

Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.

Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

SoC CAD 1 Simultaneous Continual Flow Pipeline Architecture 徐子傑 Hsu,Zi Jei Department of Electrical Engineering National Cheng Kung University Tainan,

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Computer Organization and Architecture

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Page 1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.

ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Appendix A Pipelining: Basic and Intermediate Concepts

1 Process Description and Control Chapter 3 = Why process? = What is a process? = How to represent processes? = How to control processes?

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

Hardware Supported Time Synchronization in Multi-Core Architectures 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan,

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

SplitX: Split Guest/Hypervisor Execution on Multi-Core 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.

COMP25212 CPU Multi Threading Learning Outcomes: to be able to: –Describe the motivation for multithread support in CPU hardware –To distinguish the benefits.

Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.

Processor Architecture

1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.

Processes and Virtual Memory

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

SoC CAD 2016/1/24 1 Custom designed CPU architecture based on a hardware scheduler and independent pipeline registers - concept and theory of operation.

Updating Designed for Fast IP Lookup Author : Natasa Maksic, Zoran Chicha and Aleksandra Smiljani´c Conference: IEEE High Performance Switching and Routing.

What is a Process ? A program in execution.

CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.

Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)

Introduction to Kernel

Processes and threads.

William Stallings Computer Organization and Architecture 8th Edition

Multiscalar Processors

Computer Structure Multi-Threading

Chapter 9 a Instruction Level Parallelism and Superscalar Processors

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Pipelining: Advanced ILP

Morgan Kaufmann Publishers The Processor

Instruction Level Parallelism and Superscalar Processors

Superscalar Processors & VLIW Processors

Hardware Multithreading

Topic 5: Processor Architecture Implementation Methodology

Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.

Operating Systems.

Process Description and Control

Topic 5: Processor Architecture

Levels of Parallelism within a Single Processor

Implementing Processes, Threads, and Resources

Morgan Kaufmann Publishers The Processor

Presentation transcript:

SoC CAD 2015/11/22 1 Instruction Set Extensions for Multi-Threading in LEON3 林孟諭電機系, Department of Electrical Engineering 國立成功大學, National Cheng Kung University Tainan, Taiwan, R.O.C ★★★ Xen ??? To be studied :        

NCKU SoC & ASIC Lab 2 林孟諭 Abstract  This paper describes instruction set extensions for a variant of multithreading called micro-threading for the LEON3 SPARCv8 processor.  Show an architecture of the developed processor and its key blocks - cache controller, register file, thread scheduler. 2

NCKU SoC & ASIC Lab 3 林孟諭 Introduction  This paper has designed and implemented instruction set extensions for the simpler LEON3 SPARCv8 processor suitable for embedded applications.  The goal has been twofold:  Wanted to implement in silicon machine-level instructions that were identified as necessary for efficient micro-threading [5]  to have a true picture how these extensions are expensive in terms of silicon.  Wanted to achieve an efficient implementation of these extensions,  that outperforms the original LEON3 processor by better handling memory and execution latencies.

NCKU SoC & ASIC Lab 4 林孟諭 2. Micro-Threading (1/7)  Micro-threading is a multi-threading variant that decreases the complexity of context management.  The goal of micro-threading is to tolerate long-latency operations (LD/ST and multi-cycle operations such as floating-point) and to synchronize computation on register access.  In a simple case the context can be represented by the program counter and by window pointers to the register file.  Micro-threading has been developed both on the assembly and C levels.  The basic conceptual unit is a family of threads that share data and implement one piece of a computation.  In a simple view one family corresponds to one for-loop in the classical C;  in micro-threading each iteration (each thread) of a hypothetical for-loop (represented by a family of threads) is executed independently according to data dependencies.  A family is synchronized on termination of all its threads.

NCKU SoC & ASIC Lab 5 林孟諭 2. Micro-Threading (2/7)  A possible speedup generated by micro-threading comes from the assumption that  while one thread is waiting for its input data, another thread has its input data ready and can be scheduled in a few clock cycles and executed.  the thread management logic is considered simple enough to fit in the processor HW reasonably well.  The HW requirements of micro-threading are:  use of a self-synchronizing register file (i-structures, [9]),  register states to be managed autonomously in the register file,  pipeline stalls prevented by context switch in HW,  thread status and context switch managed autonomously in a HW thread scheduler.

NCKU SoC & ASIC Lab 6 林孟諭 2. Micro-Threading (3/7)  The micro-threading support on the machine level is represented by the following instructions:  launch - switches the processor from the legacy mode (user or protected) to the micro-threaded mode.  allocate - allocates a family table entry, needed to create a family of threads.  setxxxxx - fills in the allocated family table entry with parameters required by the create instruction.  create - creates (a family of) threads based on a family table entry. . registers - a pseudo-instruction that specifies the number of registers needed by a thread.  Furthermore, each 32-bit instruction word is extended by another two bits that act as an instruction for thread scheduling. Valid combinations are:  cont - continue thread execution,  swch - switch the context to another thread, e.g. on memory load to prevent possible pipeline stall,  end - end thread execution, i.e. the thread ends at this instruction.

NCKU SoC & ASIC Lab 7 林孟諭 2. Micro-Threading (4/7)  The format of assembly instructions has been extended by a field delimited by a semicolon.  If the field is missing, cont is assumed by default. clr %r2 ld [%r1 + %g0], %r3 ; swch add %r3, %g0, %r4 ; end  To keep the 32-bit organization of the memory system in SPARCv8  2-bit extensions for groups of 15 instructions are grouped in one 32-bit instruction word  that is located at the beginning of each cache line. One cache line is formed by 16 words. The first word of each cache-line is skipped in the micro-threaded mode.

NCKU SoC & ASIC Lab 8 林孟諭 2. Micro-Threading (5/7) Fig. 1. Organization of the instruction cache. 16 words = 1 cache line

NCKU SoC & ASIC Lab 9 林孟諭 2. Micro-Threading (6/7)  Micro-threading relies on the use of a self-synchronizing register file based on i-structures [9].  To implement the i-structures each register has to be extended to contain the state of its value. A register can be  empty - on power-on reset,  pending - a memory load operation has been requested and no thread has accessed the register since,  waiting - a memory load operation has been requested and a thread has accessed the register since,  full - the register contains valid data.

NCKU SoC & ASIC Lab 10 林孟諭 2. Micro-Threading (7/7)  A sample program execution is shown in Fig. 3.  The processor starts in the legacy mode on power-on reset, then it switches to the micro-threaded mode.  The parent thread gets synchronized with the children threads by reading the register %l2.  On completion of all micro-threads the processor switches back to the legacy mode. Fig. 3. Program flow.

NCKU SoC & ASIC Lab 11 林孟諭 3. UTLEON3 Architecture (1/2)  Fig. 2 shows the architecture of UTLEON3, an extended LEON3 with ISE for micro-threading.  The core is a 32-bit integer pipeline that executes all legacy instructions.  Thread management is implemented in a thread scheduler, which can be seen as a simple 2-bit processor.  The instruction word of UTLEON3 is 34 bits wide.  All registers have been extended by 2 bits that capture register states, each register is 34 bits long.

NCKU SoC & ASIC Lab 12 林孟諭 3. UTLEON3 Architecture (2/2) Fig. 2. Architecture of UTLEON3. RUC - register update controller, RAU - register allocation unit, AU - allocation unit.

NCKU SoC & ASIC Lab 13 林孟諭 3. A. Cache Controllers (1/5)  Memory accesses are decoupled from the integer pipeline.  The cache controllers are divided in two parts connected through cache line fetch request FIFOs.  The pipeline side cache controllers store fetch requests in the FIFOs.  The memory side cache controllers process the queued requests.  On completion of an instruction cache line fetch  all threads waiting for the cache line are marked as ready for execution in the scheduler.  Cache lines that are used by threads are locked to prevent their eviction and guarantee forward progress.  On completion of a data cache line fetch  all registers that have been waiting for the data in the cache line are updated by the register update controller ( RUC ).  Cache line fetch scenarios are shown in Fig. 4, 5, 6, 7.

NCKU SoC & ASIC Lab 14 林孟諭 3. A. Cache Controllers (2/5) Fig. 4. Data cache hit/miss.

NCKU SoC & ASIC Lab 15 林孟諭 3. A. Cache Controllers (3/5) Fig. 5. Instruction cache hit/miss. Requests originated in the fetch stage.

NCKU SoC & ASIC Lab 16 林孟諭 3. A. Cache Controllers (4/5) Fig. 6. Instruction cache hit/miss. Requests originated in the execute stage.

NCKU SoC & ASIC Lab 17 林孟諭 3. A. Cache Controllers (5/5) Fig. 7. Instruction cache hit/miss. Requests originated in the scheduler.

NCKU SoC & ASIC Lab 18 林孟諭 3. B. Thread Scheduler (1/4)  The thread scheduler manages the family and thread tables, creates threads, switches context and cleans up the tables on thread completion (see Fig. 2).  Dynamic register allocation is performed on thread creation by the register allocation unit ( RAU ).  Family table and thread table store information on threads being processed in the processor.  Context switch can be the result of an explicit swch or end instruction, an instruction cache miss or it can occur on reading a register not marked full.  Threads can be in one of six states;  the state transition diagram is shown in Fig. 8.  Thread creation and context switch is shown in Fig. 9 and 10.

NCKU SoC & ASIC Lab 19 林孟諭 3. B. Thread Scheduler (2/4) Fig. 8. Transitions between thread states.

NCKU SoC & ASIC Lab 20 林孟諭 3. B. Thread Scheduler (3/4) Fig. 9. Thread creation in scheduler.

NCKU SoC & ASIC Lab 21 林孟諭 3. B. Thread Scheduler (4/4) Fig. 10. Context switch in scheduler.

NCKU SoC & ASIC Lab 22 林孟諭 Simple Benchmark Program  Fig. 11 depicts an actual implementation of the program;  the left part shows the legacy version using a for-loop,  the right side shows the micro-threaded version.  The micro-threaded version creates one family of threads (marked F1 in the picture) that correspons to the classical for-loop. Fig. 11. The benchmark program; legacy code and micro-threaded code.

NCKU SoC & ASIC Lab 23 林孟諭 Conclusion 23  This paper has described an initial implementation of instruction set extensions for micro-threading in SPARC.  The architecture of key functional blocks of the UTLEON3 processor have been presented together with implementation data for Xilinx XC2VP30 FPGA.