EECC722 - Shaaban #1 Lec # 4 Fall 2002 9-18-2002 Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.

Slides:



Advertisements
Similar presentations
Computer-System Structures Er.Harsimran Singh
Advertisements

The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
CMPT 300: Operating Systems I Dr. Mohamed Hefeeda
OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.
1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.
EECC722 - Shaaban #1 Lec # 4 Fall SMT Issues SMT CPU performance gain potential. Modifications to Superscalar CPU architecture necessary.
Memory Management (II)
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
EECC722 - Shaaban #1 Lec # 4 Fall SMT Issues SMT CPU performance gain potential. Modifications to Superscalar CPU architecture necessary.
G Robert Grimm New York University Disco.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
Computer Organization and Architecture
EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 1: Introduction.
Three fundamental concepts in computer security: Reference Monitors: An access control concept that refers to an abstract machine that mediates all accesses.
Lecture 19: Virtual Memory
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Chapter 2: Computer-System Structures Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection Network Structure.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard et al. Madhura S Rama.
Processes Introduction to Operating Systems: Module 3.
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Full and Para Virtualization
Threads. Readings r Silberschatz et al : Chapter 4.
CMPE750 - Shaaban #1 Lec # 4 Spring SMT Issues SMT CPU performance gain potential. Modifications to Superscalar CPU architecture necessary.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
PipeliningPipelining Computer Architecture (Fall 2006)
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
Processes and threads.
Process Management Process Concept Why only the global variables?
The Multikernel: A New OS Architecture for Scalable Multicore Systems
Mechanism: Limited Direct Execution
Simultaneous Multithreading
Simultaneous Multithreading
Computer Structure Multi-Threading
Intro to Processes CSSE 332 Operating Systems
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
SMT Issues SMT-7 SMT-8 SMT-9 SMT CPU performance gain potential.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Hardware Multithreading
Fast Communication and User Level Parallelism
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Threads Chapter 4.
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Operating System Overview
Resource Replication 6 Integer Units 4 FP units 8 Sets of architectural registers Renaming registers (Int/FP) HW Context (PC, Return Stack.
Presentation transcript:

EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture”, Josh Redstone et al., in Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, November ) represents the first study of OS execution on a simulated SMT processor. The SimOS environment adapted for SMT: –Alpha-based SMT CPU core added. –Digital Unix 4.0d modified to support SMT. Study goals: –Compare SMT/OS performance results with previous SMT performance results that do not account for OS behavior and impact. –Contrast OS impact between OS intensive and non OS intensive workloads. Two types of workloads selected for the study: –Non OS intensive workload: Multiprogrammed 8 SPECInt95 benchmarks. –OS intensive workload: Multi-threaded Apache web server (64 server processes), driven by the SPECWeb benchmark (128 clients). No SMT-specific OS optimizations were investigated in this study.

EECC722 - Shaaban #2 Lec # 4 Fall OS Code Vs. User Code Operating systems are usually huge programs that can overwhelm the cache and TLB due to code and data size. Operating systems may impact branch prediction performance, because of frequent branches and infrequent loops. OS execution is often brief and intermittent, invoked by interrupts, exceptions, or system calls, and can cause the replacement of useful cache, TLB and branch prediction state for little or no benefit. The OS may perform spin-waiting, explicit cache/TLB invalidation, and other operations not common in user- mode code.

EECC722 - Shaaban #3 Lec # 4 Fall SimOS SimOS is a complete machine simulation environment developed at Stanford ( Designed for the efficient and accurate study of both uniprocessor and multiprocessor computer systems. Simulates computer hardware in enough detail to boot and run commercial operating systems. SimOS currently provides CPU models of the MIPS R4000 and R10000 and Digital Alpha processor families.MIPSAlpha In addition to the CPU, SimOs also models caches, multiprocessor memory busses, disk drives, ethernet, consoles, and other system devices. SimOs has been ported for IRIX versions 5.3 (32-bit) and 6.4 (64-bit) and Digital UNIX; a port of Linux for the Alpha is being developed.IRIX Digital UNIXLinux

EECC722 - Shaaban #4 Lec # 4 Fall SimOS System Diagram

EECC722 - Shaaban #5 Lec # 4 Fall A Base SMT hardware Architecture. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages

EECC722 - Shaaban #6 Lec # 4 Fall Alpha-based SMT Processor Parameters Duplicate the register file, program counter, subroutine stack and internal processor registers of a superscalar CPU to hold the state of multiple threads. Add per-context mechanisms for pipeline flushing, instruction retirement, subroutine return prediction, and trapping. Fetch unit, Functional units, Data L1, L2, TLB shared among contexts. ~ 10% chip-area increase over superscalar. (compared to ~ 5% for Intel’s hyper- threaded Xeon)

EECC722 - Shaaban #7 Lec # 4 Fall OS Modifications for SMT Only minimal required OS modifications to support SMT considered (no OS optimizations for SMT considered here): OS task scheduler must support multiple threads in running status: –Shared-memory multiprocessor (SMP) aware OS (including Digital Unix) has this ability but each thread runs on a different CPU in SMP systems. –An SMT processor reports to such an OS as multiple shared memory CPUs (logical processors). TLB-related code must be modified: –Mutual exclusion support to access to address space number (ASN) tags of the TLB by multiple threads simultaneously. –Modified ASN assignment to account for the presence of multiple threads. –Internal CPU registers used to modify TLB entries replicated per context. No OS changes required to account for the shared L1 cache of SMT vs. the non shared L1 for SMP.

EECC722 - Shaaban #8 Lec # 4 Fall SPECInt Workload Execution Cycle Breakdown Percentage of execution cycles for OS Kernel instructions: –During program startup: 18%, mostly due to data TLB misses and to a lesser extent system calls. –Steady state: 5% still dominated by TLB misses.

EECC722 - Shaaban #9 Lec # 4 Fall Breakdown of Kernel Time for SPECInt95 5% dominated by TLB misses. 18% mostly due to data TLB misses and system calls

EECC722 - Shaaban #10 Lec # 4 Fall SPEC System Calls Percentage System calls as a percentage of total execution cycles.

EECC722 - Shaaban #11 Lec # 4 Fall SPECInt95 Dynamic Instruction Mix Percentage of dynamic instructions in the SPECInt workload by instruction type. The percentages in parenthesis for memory operations represent the proportion of loads and stores that are to physical addresses. A percentage breakdown of branch instructions is also included. For conditional branches, the number in parenthesis represents the percentage of conditional branches that are taken.

EECC722 - Shaaban #12 Lec # 4 Fall SPECInt95 SPECInt95 Total Miss rates & Distribution of Misses The miss categories are percentages of all user and kernel misses. Bold entries signify kernel-induced interference. User-kernel conflicts are misses in which the user thread conflicted with some type of kernel activity (the kernel executing on behalf of this user thread, some other user thread, a kernel thread, or an interrupt).

EECC722 - Shaaban #13 Lec # 4 Fall Metrics for SPECInt95 with and without the Operating System for both SMT and Superscalar. The maximum issue for integer programs is 6 instructions on the 8-wide SMT, because there are only 6 integer units.

EECC722 - Shaaban #14 Lec # 4 Fall Apache Workload Execution Cycle Breakdown Apache experiences little start-up period since Apache’s ‘start-up’ consists simply of receiving the first incoming requests and waking up the server threads. Once requests arrive, Apache spends over 75% of its time in the OS.

EECC722 - Shaaban #15 Lec # 4 Fall Breakdown of kernel time for Apache vs. SPECInt95 on SMT

EECC722 - Shaaban #16 Lec # 4 Fall Apache System Calls By Name

EECC722 - Shaaban #17 Lec # 4 Fall Apache System Calls By Function

EECC722 - Shaaban #18 Lec # 4 Fall Apache Dynamic Instruction Mix The percentages in parenthesis for memory operations represent the proportion of loads and stores that are to physical addresses. A percentage breakdown of branch instructions is also included. For conditional branches, the number in parenthesis represents the percentage of conditional branches that are taken.

EECC722 - Shaaban #19 Lec # 4 Fall All applications are executing with the operating system. Metrics for SMT SPEC, Apache & Superscalar Apache

EECC722 - Shaaban #20 Lec # 4 Fall Apache+OS Apache+OS Total Miss rates & Distribution of Misses The miss categories are percentages of all user and kernel misses. Bold entries signify kernel-induced interference. User-kernel conflicts are misses in which the user thread conflicted with some type of kernel activity (the kernel executing on behalf of this user thread, some other user thread, a kernel thread, or an interrupt).

EECC722 - Shaaban #21 Lec # 4 Fall Percentage of Misses Avoided Due to Interthread Cooperation on Apache Percentage of misses avoided due to interthread cooperation on Apache, shown by execution mode. The number in a table entry shows the percentage of overall misses for the given resource that threads executing in the mode indicated on the leftmost column would have encountered, if not for prefetching by other threads executing in the mode shown at the top of the column.

EECC722 - Shaaban #22 Lec # 4 Fall OS Impact on Hardware Structures Performance

EECC722 - Shaaban #23 Lec # 4 Fall OS Impact on SMT Study Summary Results show that for SMT, omission of the operating system did not lead to a serious misprediction of performance for SPECInt, although the effects were more significant for a superscalar executing the same workload. On the Apache workload, however, the operating system is responsible for the majority of instructions executed: –Apache spends a significant amount of time responding to system service calls in the file system and kernel networking code. –The result of the heavy execution of OS code is an increase of pressure on various low-level resources, including the caches and the BTB. –Kernel threads also cause more conflicts in those resources, both with other kernel threads and with user threads; on the other hand, there is an positive interthread sharing effect as well.

EECC722 - Shaaban #24 Lec # 4 Fall Possible SMT-specific OS Optimizations Smart SMT-optimized OS task scheduler for better SMT-core performance: –Schedule cooperating threads that benefit from SMT’s resource and data sharing to run simultaneously. –To aid SMT’s latency-hiding, avoid scheduling too many threads that have conflicts over same specific CPU resource (TLB, cache FP etc.) –For SMP-SMT system tightly-coupled threads should be scheduled to logical processors in the same physical SMT CPU (processor affinity). Introduce a lightweight dedicated kernel context to cached in the SMT-core to handle process management and speedup system calls. Prevent the “idle loop” thread from consuming execution resources: –Intel Hyper-threading solution: use HALT instruction. Allow thread caching in the CPU to further reduce context- switching overheads.