Two Threads Are Better Than One

Slides:

Advertisements

Similar presentations

Dynamic Thread Mapping for High- Performance, Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu Jinpyo Park Diana Marculescu Presented By Ravi.

Advertisements

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/ Computer Architecture.

Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.

Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.

COLLEGE FOR PROFESSIONAL STUDIES TOPIC OF PRESENTATION PROCESSOR IN COMPUTER.

An Intro to AIX Virtualization Philadelphia CMG September 14, 2007 Mark Vitale.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

JPCM - JDC121 JPCM. Agenda JPCM - JDC122 3 Software performance is Better Performance tuning requires accurate Measurements. JPCM - JDC124 Software.

IT253: Computer Organization

©Copyright 2008, Computer Management Sciences, Inc., Hartfield, VA 1 Introduction to HiperDispatch Management Mode with z10 NCACMG meeting.

1 Multithreaded Programming Concepts Myongji University Sugwon Hong 1.

Z13: Simultaneous Multithreading and System Level Testing Ali Duale Dennis Wittig Shailesh Gami.

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

© GCSE Computing Computing Hardware Starter. Creating a spreadsheet to demonstrate the size of memory. 1 byte = 1 character or about 1 pixel of information.

Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.

Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.

TS7700 Performance and Capacity Daily charts only Enter Description, Month & Year in this Text Box.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

System Performance Monitoring at RBC Craig Hodgins zSeries Performance Engineer Royal Bank of Canada.

CS203 – Advanced Computer Architecture

Saving Software Costs with Group Capacity Richard S. Ralston OHVCMGMay 13, 2010.

From the Trenches OHVCMG May 13, 2010 Richard S. Ralston Antarctica.

Computer System Structures Storage

Adjusting RMPTTOM to Reduce SRM Overhead Kevin Martin – McKesson

Getting the Most out of Scientific Computing Resources

GCSE OCR Computing A451 The CPU Computing hardware 1.

CSCI206 - Computer Organization & Programming

CS203 – Advanced Computer Architecture

Getting the Most out of Scientific Computing Resources

Virtual memory.

COSC3330 Computer Architecture

Chapter 2.1 CPU.

Definition CASE tools are software systems that are intended to provide automated support for routine activities in the software process such as editing.

Microarchitecture.

Software Architecture in Practice

Green cloud computing 2 Cs 595 Lecture 15.

Lecture 12 Virtual Memory.

Multi-core processors

How will execution time grow with SIZE?

VTS Health Assessment from SMF Type 94 records for

Assembly Language for Intel-Based Computers, 5th Edition

Multi-core processors

CS-301 Introduction to Computing Lecture 17

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Jonathan Gladstone, P. Eng

Hyperthreading Technology

CSCI206 - Computer Organization & Programming

Lecture 2: Performance Today’s topics: Technology wrap-up

Simulation of computer system

Andy Wang Operating Systems COP 4610 / CGS 5765

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Presentation & Demo August 7, 2018 Bill Shelden.

Chapter 1 Introduction.

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

Computer Evolution and Performance

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

PROCESSES & THREADS ADINA-CLAUDIA STOICA.

Hardware Multithreading

Year 10 Computer Science Hardware - CPU and RAM.

Student : Yan Wang student ID:

Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.

Presented by Florian Ettinger

Presentation transcript:

Two Threads Are Better Than One Craig Hodgins zSeries Performance Engineer Royal Bank of Canada

What is SMT2? SMT2 is Simultaneous Multithreading x2 CPU is now called a core An instruction stream is now called a thread Allows 2 threads to execute on one zIIP core

Why SMT2? processor speeds are approaching the physical limits attempt to use parallelism to increase capacity

Faster execution but lower throughput Slower execution but higher throughput

SMT2 Requirements Enabled Turned ON

Roll Out Methodology new system measurement metrics may affect performance tools, capacity planning, and chargeback reporting for example RMF, MXG, TDS desirable to detect and assess any measurement impacts as early as possible on test systems before rolling out to production [sysprog/dev/test/prod] APAR Identifier ...... OA47662 Last Changed ........ 15/08/07 * PROBLEM DESCRIPTION: RMF Monitor III PROC and PROCU * * reports: * * Lost of precision for APPL% and EAPPL% * * fields when running in PROCVIEW CORE * * mode and MT_1 mode only. *

Rollout Methodology Enabling at least one LPAR per production sysplex with different characteristics and workload mix would be useful In other words, don’t do the whole sysplex at one time I created a spreadsheet to track the project

SMT2 Verification Review messages after SET OPT=xx Review SDSF Review RMF

Messages After SET OPTxx 00:27:05 E SET OPT=MH 00:27:05 E IEE252I MEMBER IEAOPTMH FOUND IN SYS1.PARMLIB 00:27:05 E IEE536I OPT VALUE MH NOW IN EFFECT 00:27:06 E IWM066I MT MODE CHANGED FOR PROCESSOR CLASS zIIP. THE MT MODE WAS CHANGED FROM 1 TO 2.

SDSF D M=CPU

RMF CPC Report

New Metrics MT-2 MAX CF (Capacity Factor) is the ratio of the maximum amount of work that can be accomplished using 2 threads to the amount of work that would have been accomplished with 1 thread MT-2 Max CF is workload dependent (the max value is 2 and IBM expects average values of about 1.4) The MT-2 CF is the ratio of the maximum amount of work that has been accomplished using 1 or 2 threads to the amount of work that would have been accomplished with multithreading turned off The Average Thread Density shows the average number of threads that have been simultaneously active in the measured interval

SMT2 Benefits • SMT delivers more throughput per core, therefore more capacity • Less power and cooling required per unit of capacity • But an individual SMT2 thread is slower than a single thread would be (we’ll see why in a minute) • If an SMT2 core provides 140% of the capacity of a single thread, then two threads will (on average) each run at 70% of the single-thread speed when both threads are active • Increased sharing of low-level resources by threads makes the amount of work that a thread can do dependent on what else the core is doing

What Causes the Slowdown? • A major cause is the sharing of processor cache • On recent System z processors, there are two levels of cache that are private to each core (L1 and L2) • If a core has more than one thread, these caches will be shared across both threads • Each thread is forced to get by with a smaller footprint in these caches and so incurs more L1 and L2 misses than if the caches were not shared • Other resources must also be shared: • The execution pipes • The translation lookaside buffer (TLB) • Physical General Purpose Registers • Store Buffers and other resources on the core

What to Expect • Actual throughput for SMT2 can range from less than 100% to close to 200%, depending upon the usage of the shared resources • If programs running on the same core utilize the same resources (competing), they will run slower than before • If programs use different resources (complimentary), they can run close to the ideal maximum speed • Running the same application multiple times shows less repeatable CPU usage because it may run in differing environments

What Did RBC See? Using 3 LPARs as a sample…. There was no noticeable response time or task delay impact with “slower” SMT2 zIIP threads There was no zIIP CPU consumption or chargeback volume change We realized approximately 10% reduction in relative physical zIIP utilization on a large LPAR, but only 3% reduction on smaller LPARs. The overall weighted zIIP capacity utilization benefit from SMT2 across all large and small LPARs was about 8% (compare to IBM’s claim of expected 25%-40% zIIP capacity benefit from SMT2). No major issues (23 LPARs converted with 17 left to go)

What Does the Future Hold? Other platforms have had SMTx for years IBM currently only supports SMT2 on a zIIP IBM future support?

Considerations Vendors need to catch up with SMT2 IBM (RMF PTF) MXG May have to make reporting changes internally

Recommendations / Summary SMT2 should be explored in order to exploit capacity and throughput improvements on a z13 Enable SMT2 in a formal and controlled manner Compare before/after metrics carefully Workload drives results/benefits Your mileage will vary

References There are various CMG and SHARE papers available on the Internet IBM marketing/technical material EPV white papers Google “SMT2 z13”

Q&A and Discussion Are you on z13 boxes? Has your company implemented SMT2? If not, why not? If so, what did you see?