Using LINPACK + STREAM to Build a Simple, Composite Metric for System Throughput John D. McCalpin, Ph.D. IBM Corporation Austin, TX Presented 2002-11-20,

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Parallel computer architecture classification
Statistical Techniques I EXST7005 Multiple Regression.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Estimating from Samples © Christine Crisp “Teach A Level Maths” Statistics 2.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Lecture 7: 9/17/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
Copyright 2004 David J. Lilja1 What Do All of These Means Mean? Indices of central tendency Sample mean Median Mode Other means Arithmetic Harmonic Geometric.
1 Advanced Database Technology February 12, 2004 DATA STORAGE (Lecture based on [GUW ], [Sanders03, ], and [MaheshwariZeh03, ])
Common Factor Analysis “World View” of PC vs. CF Choosing between PC and CF PAF -- most common kind of CF Communality & Communality Estimation Common Factor.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Statistics CSE 807.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Sep 3, 2003 Lecture 2.
© 2009 IBM Corporation Statements of IBM future plans and directions are provided for information purposes only. Plans and direction are subject to change.
Computing Systems Memory Hierarchy.
GPGPU platforms GP - General Purpose computation using GPU
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
Chapter 13: Inference in Regression
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 22 Regression Diagnostics.
Author : Ozgun Erdogan and Pei Cao Publisher : IEEE Globecom 2005 (IJSN 2007) Presenter : Zong-Lin Sie Date : 2010/12/08 1.
Sampling Distributions & Standard Error Lesson 7.
Operation Frequency No. of Clock cycles ALU ops % 1 Loads 25% 2
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Optimum System Balance for Systems of Finite Price John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2004 Pittsburgh, PA November 10, 2004.
CPEN Digital System Design
1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.
Performance measurement with ZeroMQ and FairMQ
Computer Architecture
Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power.
BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
University of Virginia PID Controllers Jack Stankovic University of Virginia Spring 2015.
Performance Performance
1 System-Level Vulnerability Estimation for Data Caches.
Availability in CMPs By Eric Hill Pranay Koka. Motivation RAS is an important feature for commercial servers –Server downtime is equivalent to lost money.
Autoregressive (AR) Spectral Estimation
Copyright © 2011 Pearson Education, Inc. Regression Diagnostics Chapter 22.
Sunpyo Hong, Hyesoon Kim
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 2: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed. Oct. 4,
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
Lecture 2: Performance Evaluation
Cache Memory and Performance
Memory COMPUTER ARCHITECTURE
Ioannis E. Venetis Department of Computer Engineering and Informatics
Parallel computer architecture classification
How will execution time grow with SIZE?
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang
Cache Memory Presentation I
Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
ECE 445 – Computer Organization
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CHAPTER 29: Multiple Regression*
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Presentation transcript:

Using LINPACK + STREAM to Build a Simple, Composite Metric for System Throughput John D. McCalpin, Ph.D. IBM Corporation Austin, TX Presented , Revised to

The STREAM benchmark Not as old as LINPACK, but older than the TOP500 list –First published in 1991, revised as new results delivered –Currently contains 750 results in the table –Covers a very wide variety of systems STREAM is also an accidental benchmark –Developed in to show why available systems had such wide variation in performance on my ocean models

STREAM (continued) STREAM is very simple –Measure the sustained memory bandwidth for a set of four computational kernels COPY:A(I) = B(I) SCALE:A(I) = Q*B(I) ADD:A(I) = A(I) + B(I) TRIAD:A(I) = A(I) + Q*B(I) –Make sure that each of the vectors is chosen to be larger than the available cache(s)

STREAM (continued) STREAM is not intended to predict performance STREAM acts as a counterpoint to LINPACK –LINPACK does not require significant memory bandwidth –STREAM does not require significant computational capability Can LINPACK and STREAM be used together to provide much more information to the public than either metric in isolation?

Does Peak GFLOPS predict SPECfp_rate2000?

Does Sustained Memory Bandwidth predict SPECfp_rate2000?

A Simple Composite Model Assume the time to solution is composed of a compute time proportional to peak GFLOPS plus a memory transfer time proportional to sustained memory bandwidth Assume “1 Byte/FLOP” to get: Effective GFLOPS = 1 / (1/PeakGFLOPS + 1/MemoryBW) Use performance of 171.swim from SPECfp_rate2000 as a proxy for memory bandwidth Sustained BW = (478.3 GB * (# of copies)) / (run time for 171.swim) Testing with STREAM bandwidth (where available) gives similar results

How does the “1 Byte per FLOP model do?”

“1 Byte/FLOP” model error vs cache size

Make “Bytes/FLOP” a simple function of cache size Minimize RMS error to calculate the four parameters: –Bytes/FLOP for large caches –Bytes/FLOP for small caches –Size of asymptotically large cache –Coefficient of best-fit to SPECfp_rate2000/cpu Results (rounded to nearby round values): –Bytes/FLOP for large caches === (*) –Bytes/FLOP for small caches === 1.00 –Size of asymptotically large cache === 6 MB –Coefficient of best fit === 6.7 (+/- 0.1) (*) –The units of the coefficient are SPECfp_rate2000 / Effective GFLOPS (*) Revised to

Here is what I assumed:

Does this Revised Metric predict SPECfp_rate2000?

Statistical Metrics Revised to

Comments Obviously, these coefficients were derived to match the SPECfp_rate2000 data set, not a “typical” set of supercomputing applications However, the results are encouraging, delivering a projection with 15% accuracy (one sigma) using a model based on only one measurement (sustainable memory bandwidth), plus specification of several architectural features

Final Summary The proposed metric is –Simple to understand –Simple to measure –Based on a correct mathematical model of performance –Based on a potentially oversimplified model of the memory hierarchy Adding STREAM benchmark data to the TOP500 database would allow such metrics to be constructed Obviously, much work remains on consistent modelling of the influence of non-local data transfer in the performance of supercomputers –The good news is that LogP models have been very successful in the past, using methodologies similar to what I have described here

Backup Slides Important stuff that I don’t have time to talk about in an 8 minute time slot!

Methodology I extracted all the SPECfp_rate2000 (peak) results from the SPEC database on about –Include cpu type, cpu frequency, cache sizes –Include 171.swim performance (time in seconds) –The data set included 331 results I added “peak FP ops/Hz” based on other published sources –Revisions on corrected this value for Fujitsu systems I defined “cache” as the largest on-chip cache or off-chip SRAM cache. –Off-chip DRAM caches are much less effective at improving performance, and degrade the accuracy of the proposed metric Revised to

Does the Optimized Metric have Systematic Errors vs Cache Size? Revised to

Does the Optimized Metric have Systematic Errors vs SMP size? Revised to

More Comments There is some evidence of a systematic error related to load/store capability vs FP capability –Systems with one load or store per peak FLOP (e.g., EV6, PentiumIII) do better on SPECfp_rate2000 than the optimized metric suggests –Systems capable of 4 FP ops per clock do less well on SPECfp_rate2000 than the metric suggests

More Comments (continued) Unfortunately, recasting in terms of peak LoadStore rate does not improve the statistics Arbitrary harmonic combinations of Peak GFLOPS and Peak LoadStore provide only a little improvement in the statistics, though it does eliminate a few of the outliers –R-squared increases from 0.85 to 0.87 –Normalized Std Error decreases from 16% to 15% Revised to