Measuring and Modeling Hyper-threaded Processor Performance Ethan Bolker UMass-Boston September 17, 2003.

Slides:



Advertisements
Similar presentations
Introduction to Queuing Theory
Advertisements

Recap Measuring and reporting performance Quantitative principles Performance vs Cost/Performance.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
CS CS 5150 Software Engineering Lecture 19 Performance.
Processor Technology and Architecture
1 CS 501 Spring 2002 CS 501: Software Engineering Lecture 19 Performance of Computer Systems.
Priority Scheduling: An Application for the Permutahedron Ethan Bolker UMass-Boston BMC Software AMS Toronto meeting September 24, 2000.
RAIDs Performance Prediction based on Fuzzy Queue Theory Carlos Campos Bracho ECE 510 Project Prof. Dr. Duncan Elliot.
CS 501: Software Engineering Fall 2000 Lecture 19 Performance of Computer Systems.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Data Communication and Networks Lecture 13 Performance December 9, 2004 Joseph Conron Computer Science Department New York University
Fair Share Scheduling Ethan Bolker UMass-Boston BMC Software Yiping Ding BMC Software CMG2000 Orlando, Florida December 13, 2000.
Performance Evaluation
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
1 CS 501 Spring 2005 CS 501: Software Engineering Lecture 22 Performance of Computer Systems.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Overview Software Quality Assurance Reliability and Availability
Introduction to Queuing Theory
AN INTRODUCTION TO THE OPERATIONAL ANALYSIS OF QUEUING NETWORK MODELS Peter J. Denning, Jeffrey P. Buzen, The Operational Analysis of Queueing Network.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Computing hardware CPU.
Copyright ©: Nahrstedt, Angrave, Abdelzaher, Caccamo1 Queueing Systems.
Queuing models Basic definitions, assumptions, and identities Operational laws Little’s law Queuing networks and Jackson’s theorem The importance of think.
1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Lecture 14 – Queuing Networks Topics Description of Jackson networks Equations for computing internal arrival rates Examples: computation center, job shop.
NETE4631:Capacity Planning (2)- Lecture 10 Suronapee Phoomvuthisarn, Ph.D. /
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
1 CMG, 2006 Reno Yiping Ding and Ethan Bolker How Many Guests Can You Serve? - On the Number of Partitions.
1 CS 501 Spring 2006 CS 501: Software Engineering Lecture 22 Performance of Computer Systems.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
1 Chapters 8 Overview of Queuing Analysis. Chapter 8 Overview of Queuing Analysis 2 Projected vs. Actual Response Time.
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
SIMULATION OF A SINGLE-SERVER QUEUEING SYSTEM
1 Kansas City CMG, 2005 Ethan Bolker and Yiping Ding October, 2005 Virtual performance won't do : Capacity planning for virtual systems.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems with Multi-programming Chapter 4.
Chapter 10 Verification and Validation of Simulation Models
Modeling Virtualized Environments in Simalytic ® Models by Computing Missing Service Demand Parameters CMG2009 Paper 9103, December 11, 2009 Dr. Tim R.
CENTRAL PROCESSING UNIT. CPU Does the actual processing in the computer. A single chip called a microprocessor. Composed of an arithmetic and logic unit.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Lecture on Central Process Unit (CPU)
Internet Applications: Performance Metrics and performance-related concepts E0397 – Lecture 2 10/8/2010.
1 CS 501 Spring 2003 CS 501: Software Engineering Lecture 23 Performance of Computer Systems.
Copyright ©: Nahrstedt, Angrave, Abdelzaher, Caccamo1 Queueing Systems.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Sunpyo Hong, Hyesoon Kim
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Simulation. Types of simulation Discrete-event simulation – Used for modeling of a system as it evolves over time by a representation in which the state.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Computer Organization and Architecture Lecture 1 : Introduction
OPERATING SYSTEMS CS 3502 Fall 2017
Lecture 14 – Queuing Networks
William Stallings Computer Organization and Architecture 8th Edition
Simultaneous Multithreading
Architecture & Organization 1
What happens inside a CPU?
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Chapter 10 Verification and Validation of Simulation Models
Architecture & Organization 1
Queuing models Basic definitions, assumptions, and identities
Hardware Multithreading
Queuing models Basic definitions, assumptions, and identities
Variability 8/24/04 Paul A. Jensen
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CSE 550 Computer Network Design
COT 4600 Operating Systems Fall 2009
Presentation transcript:

Measuring and Modeling Hyper-threaded Processor Performance Ethan Bolker UMass-Boston September 17, 2003

Joint work with Yiping Ding, Arjun Kumar (BMC Software) Accepted for presentation at CMG32, December 2003 Paper (with references) available on request

Improving Processor Performance Speed up clock Invent revolutionary new architecture Replicate processors (parallel application) Remove bottlenecks (use idle ALU) –caches –pipelining –prefetch

Hyper-threading Technology (HTT) Default for new Intel high end chips One ALU Duplicate state of computation (registers) to create two logical processors (chip size *= 1.05) Parallel instruction preparation (decode) ALU should see ready work more often (provided there are two active threads)

The path to instruction execution Intel Technology Journal, Volume 06 Issue 01, February 14, 2002, p8

How little must we understand? Batch workload: repeated dispatch of identical compute intensive jobs –vary number of threads –measure throughput (jobs/second) Treat processor as a black box Experiment to observe behavior Model to predict behavior

Batch throughput } make sense } puzzling } makes sense

Transaction processing More interesting than batch Random size jobs arrive at random times M/M/1 M = “Markov” M/*/* : arrival stream is Poisson, rate */M/* : job size exponentially distributed, mean s */*/1 : single processor

M/M/1 model evaluation Utilization: U = s U is dimensionless: jobs/sec * sec/job U < 1 else saturation Response time: r = s/(1-U) randomness  each job sees (virtual) processor slowed down (by other jobs) by factor 1/(1-U), so to accumulate s seconds of real work takes r = s/(1-U) seconds of real time

Benchmark Java driver –chooses interarrival times and service times from exponential distributions, –dispatches each job in its own thread, –records actual job CPU usage, response time Input parameters –job arrival rate –mean job service time s Fix s = 1 second, vary (hence U), track r

Benchmark validation practice: measured theory: M/M/1 R = 1/(1-U)

Theory vs practice “In theory, there is no difference between theory and practice. In practice, there is no relationship between theory and practice.” Grant Gainey “The gap between theory and practice in practice is much larger than the gap between theory and practice in theory.” Jeff Case

Explain/remove discrepancy Examine, tune benchmark driver Compute actual coefficients of variation, incorporate in corrected M/M/1 formula Nothing helps Postpone worry – in the meanwhile …

HTT on vs HTT off Use this benchmark to measure the effect of hyper-threading on response time Use throughput ( ) as the independent variable “Utilization” is ambiguous (digression)

HTT on vs HTT off

What’s happening Hyper-threading allows more of the application parallelism to make its way to the ALU Can we understand this quantitatively?

Model HTT architecture /2 s 1 s 2 r = + 1 – ( /2) s 1 1 – s 2 preparatory phase service time s 1 execution phase service time s 2

Theory vs practice s 1 = 0.13 s 2 = 0.81

Model parameters To compute response time r from model, need (virtual) service parameters s 1, s 2 ( is known) Finding s 1, s 2 –eyeball measured data –fit two data points –maximum likelihood –derive from first principles s 1 = 0.13, s 2 = 0.81 make sense 15% of work is preparatory, 85% execution

Benchmark validation (reprise) Chip hardware unchanged when HTT off Assume one path used Tandem queue Parameter estimation as before 0

Theory vs practice s 1 = s 2 = 0.878

Future work Do serious statistics Does 1+1 tandem queue model predict hyper- threading response as well as complex 2+1 model? Understand two-processor machine puzzle Explore how s 1 and s 2 vary with application (e.g. fixed vs floating point) Find ways to estimate s 1 and s 2 from first principles

Summary Hyper-threading is … Abstraction (modelling) leverages information: you can often understand a lot even when you know very little r = s/(1-U) is worth remembering You do need to connect theory and practice – and practice is harder than theory Questions?