Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Slides:

Advertisements

Similar presentations

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Advertisements

Nikos Hardavellas, Northwestern University

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.

Introduction CSCI 444/544 Operating Systems Fall 2008.

Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research John Cieslewicz Columbia.

Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

DISTRIBUTED CONSISTENCY MANAGEMENT IN A SINGLE ADDRESS SPACE DISTRIBUTED OPERATING SYSTEM Sombrero.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

Computer System Architectures Computer System Software

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.

Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment.

Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Srihari Makineni & Ravi Iyer Communications Technology Lab

Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard et al. Madhura S Rama.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Concurrency, Processes, and System calls Benefits and issues of concurrency The basic concept of process System calls.

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.

Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.

*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.

CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.

1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.

Computer Sciences Department University of Wisconsin-Madison

COMP 740: Computer Architecture and Implementation

Ph.D. in Computer Science

Simultaneous Multithreading

Computer Structure Multi-Threading

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Computer Architecture: Multithreading (I)

Department of Computer Science University of California, Santa Barbara

Levels of Parallelism within a Single Processor

Hardware Multithreading

EE 382N Guest Lecture Wish Branches

Address-Value Delta (AVD) Prediction

Adaptive Optimization in the Jalapeño JVM

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up

Jinquan Dai, Long Li, Bo Huang Intel China Software Center

Hardware Counter Driven On-the-Fly Request Signatures

The Vector-Thread Architecture

Levels of Parallelism within a Single Processor

Department of Computer Science University of California, Santa Barbara

CSC Multiprocessor Programming, Spring, 2011

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Outline Helper threads VMT ideas Implementation details Hardware Firmware Compiler Results Conclusion.

Helper threads Used in Multi-threaded architectures to prefetch hard-to-predict delinquent data or compute hard-to-predict branches. Threads share resources as fetch bandwith and functional units.

Hyper Threading (Intel - P4) Each hardware thread context is exposed as logical processor to the OS. OS finds threads for execution and binds them to the logical processor. User has to use OS-visible thread API to create and manage threads.

Helper Threads - Issues Resource contention among multiple helper threads Adaptable invocation for different program phases. Threads have to be self-throttling. OS based thread synchronization is unpredictable and has long latency ( ~micro secs)

Virtual Multithreading Single processor supports multiple thread contexts. Monitors long latency micro-architectural events. Switches to different Instruction in same program in 100 cycles. OS transparent Uses firmware support in Itanium 2 processor to reduce context switch time.

Context switch requires

Advantages with VMT on Itanium Ability to track micro-architectural events without involvement of the OS. Eg: Last level cache misses. Large register set partitioned by compiler for helper threads Register communication is easier Value Synchronization - no memory comm. OS context switches allow threads to be resumed on any processor.

New Instructions Yield  Synchronous transfer to VMT thread, similar to branch misprediction Yield conditional  Transfer only when pipeline stalls at some later instruction.  Execution proceed, instructions retire  No pipeline stall  instruction behaves as nop

Key Characteristics Self throttling – main and helper threads keep counters to track progress (iteration counter) Helper thread falls behind -> reload value Helper thread runs too far ahead -> relinquish ctrl. Main thread begins execution at instruction that triggered helper thread invocation. VMT preserves thread continuation of helper threads -> helper thread can restart where it stopped.

Key Characteristics VMT has to maintain Initial instruction address. Continuation instruction address Compiler preserves 2 registers for the purpose. Support for multiple helper threads can be done by reprogramming these registers.

Itanium Firmware Programmable debugging hardware support for PAL To enable silicon debugging & validation. PAL can program PMU to monitor and count events of interest - opcode monitoring, instruction addr., Data addr. Debugging hardware can trigger a PAL handler when the monitored even occurs.

Firmware VMT mechanism emulated by firmware infrastructure. Opcode monitoring to simulate yield and yield conditional. PAL programs PMU to track Last level cache misses Pipeline stalls Instructions with special opcodes. Thread switch latency = pipeline flush + overhead for manipulating registers. (~140 cycles giving 60 cycles of computation time when memory miss ~200 cycles)

Experimental machine. 4 way 1.5 Ghz Itanium 2 processor based MP system with 16 GB of RAM. Separate 16 KB 4-way set associate L1 I- and D- cache Shared 256 KB 8-way set associative L2 cache. 6 MB 24-way L3 cache that can be configured as 1 MB 4-way set associative cache.

workloads MCF – combinatorial optimization. VPR – FPGA Circuit Placement and Routing DOT – graph layout optimization tool DSS system running on 100 GB IBM DB2 database 6 queries with long run time and span large portions of database. 95% cpu utilization, 40 concurrent threads

Compiler and optimizations Electon –O3, IPA, Profile guided opt., Itanium2 specific opt. Recompiled to obtain threads and linked with original binaries. Register partitioning to minimize VMT context switch Aggressive software prefetching with profile feedback. Ld.s, chk, predication, branch prediction hints.

SpeedUp A few helper threads give good speedup Significant fraction of L3 misses are removed from main thread (avg. 48%). Capacity misses in L3 are due to pointer chasing. Helper thread size is small. Helper thread can contain control flow dependencies also. Throughput improved by reducing latency of individual threads

Conclusion Fly-weight context switching 5.8 – 38.5% increase for SPEC2000 INT 5-12% speedup on DSS workload. VMT threads are to be invoked based on program behavior depending on number of cache misses.

My view on limitations Requires large register files and firmware support. Too Itanium (not adaptable to other architectures). Scalability of helper threads.(# helper threads running at one time…to complex)