Optimum System Balance for Systems of Finite Price John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2004 Pittsburgh, PA November 10, 2004.

Slides:



Advertisements
Similar presentations
Lecture 18: Temporal-Difference Learning
Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Lecture 6: Multicore Systems
Parallel computer architecture classification
Supercomputing Challenges at the National Center for Atmospheric Research Dr. Richard Loft Computational Science Section Scientific Computing Division.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Signals The main function of the physical layer is moving information in the form of electromagnetic signals across a transmission media. Information can.
Performance Analysis of Multiprocessor Architectures
Copyright 2004 David J. Lilja1 Performance metrics What is a performance metric? Characteristics of good metrics Standard processor and system metrics.
Copyright 2004 David J. Lilja1 What Do All of These Means Mean? Indices of central tendency Sample mean Median Mode Other means Arithmetic Harmonic Geometric.
Introduction CS 524 – High-Performance Computing.
A Heuristic Bidding Strategy for Multiple Heterogeneous Auctions Patricia Anthony & Nicholas R. Jennings Dept. of Electronics and Computer Science University.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
SC’07 7 th International APART Workshop Panel 11 November 2007 David Koester, Ph.D.
Lecture 3: A Case for RAID (Part 1) Prof. Shahram Ghandeharizadeh Computer Science Department University of Southern California.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
University of Michigan Electrical Engineering and Computer Science Low-Power Scientific Computing Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke.
Capacity planning for web sites. Promoting a web site Thoughts on increasing web site traffic but… Two possible scenarios…
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
© 2009 IBM Corporation Statements of IBM future plans and directions are provided for information purposes only. Plans and direction are subject to change.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. System Level Resource Discovery and Management for Multi Core Environment Javad.
CMSC 611: Advanced Computer Architecture Benchmarking Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
1 Computer Performance: Metrics, Measurement, & Evaluation.
An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
SVM by Sequential Minimal Optimization (SMO)
BİL 221 Bilgisayar Yapısı Lab. – 1: Benchmarking.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
WMPI 2006, Austin, Texas © 2006 John C. Koob An Empirical Evaluation of Semiconductor File Memory as a Disk Cache John C. Koob Duncan G. Elliott Bruce.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Database Architecture Optimized for the new Bottleneck: Memory Access Chau Man Hau Wong Suet Fai.
4th Annual INFORMS Revenue Management and Pricing Section Conference, June 2004 An Asymptotically-Optimal Dynamic Admission Policy for a Revenue Management.
Using LINPACK + STREAM to Build a Simple, Composite Metric for System Throughput John D. McCalpin, Ph.D. IBM Corporation Austin, TX Presented ,
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Slide-1 HPCchallenge Benchmarks MITRE ICL/UTK HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. (MITRE) Jack Dongarra (UTK) Piotr Luszczek (ICL/UTK)
HPCMP Benchmarking Update Cray Henry April 2008 Department of Defense High Performance Computing Modernization Program.
1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.
Karu Sankaralingam University of Wisconsin-Madison Collaborators: Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, and Doug Burger The Dark Silicon Implications.
WMPI 2006, Austin, Texas © 2006 John C. Koob An Empirical Evaluation of Semiconductor File Memory as a Disk Cache John C. Koob Duncan G. Elliott Bruce.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Cost and Performance.
A BRIEF INTRODUCTION TO CACHE LOCALITY YIN WEI DONG 14 SS.
Exscale – when will it happen? William Kramer National Center for Supercomputing Applications.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
4. Performance 4.1 Introduction 4.2 CPU Performance and Its Factors
Lenovo - Eficiencia Energética en Sistemas de Supercomputación Miguel Terol Palencia Arquitecto HPC LENOVO.
edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin
Linear Programming McGraw-Hill/Irwin Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM
CS4234 Optimiz(s)ation Algorithms L2 – Linear Programming.
Operational Amplifiers
Lecture 2: Performance Evaluation
How Much Physical Activity Is Enough?
Chapter 6: Temporal Difference Learning
Input-output I/O is very much architecture/system dependent
Gerald Dyer, Jr., MPH October 20, 2016
SiCortex Update IDC HPC User Forum
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J
1.7 - Exploring Linear Systems
Presentation transcript:

Optimum System Balance for Systems of Finite Price John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2004 Pittsburgh, PA November 10, 2004

Overview The HPC Challenge Benchmark was announced last year at SuperComputing’2003 The HPC Challenge Benchmark consists of –LINPACK (HPL) –STREAM –PTRANS (transposing the array used by HPL) –RandomAccess (random read/modify/write) –FFT –and some low-level MPI latency & BW measurements No single figure of merit is defined

Overview (continued) Q: What is a “balanced” system? My answer: “A balanced system is one for which the primary applications are limited in performance by the most expensive component(s) of the system.”

The Two Questions We need to understand both performance and cost in the context of low-level component metrics Performance –What performance model? –Use a harmonically weighted, time-based model Cost –What cost model? –Simple linear additive cost model

Performance Model Composite Figures of Merit for Performance must be based on “time” rather than “rate” –i.e., weighted harmonic means of rates Why? –Combining “rates” in any other way fails to have a “Law of Diminishing Returns” Time = Work/Rate Repeat for each component: T i = W i /R i

A Simple Composite Model Assume the time to solution is composed of a compute time proportional to peak GFLOPS plus a memory transfer time proportional to sustained memory bandwidth Assume “x Bytes/FLOP” to get: Target SPECfp_rate2000 as the workload

Does Peak GFLOPS predict SPECfp_rate2000?

Does Sustained Memory Bandwidth predict SPECfp_rate2000?

Optimized Model Results Results rounded to nearby round values: –Bytes/FLOP for large caches === 0.16 –Bytes/FLOP for small caches === 0.80 –Size of asymptotically large cache === ~12 MB –Coefficient of best fit === ~6.4 –The units of the coefficient are SPECfp_rate2000 / Effective GFLOPS

Does this Revised Metric predict SPECfp_rate2000?

Cost Model Assume simple linear additive model –FLOPS cost some amount –Sustained BW costs a different amount –Define:  = R mem / R cpu  = W mem / W cpu  = ($/BW) / ($/FLOPS) System characteristics Application characteristics Technology characteristics

How to Optimize? For a given application  (W mem /W cpu ), what is the optimum system balance  ? Many people seem to believe that the system should be “balanced” to the application: –  optimal =  i.e., – R mem /R cpu = W mem /W cpu This does not optimize price/performance

The Correct Optimization This is actually an easy optimization problem Minimize cost/performance –Same as minimizing cost * time Optimum cost/performance occurs at –  = sqrt(  /  ) Definitely not intuitive!

Example: High BW, expensive BW  = 3  relatively high BW  = 3  relatively expensive BW Optimum price/performance is at  =1, not  =3 Improvement in price/performance is ~30%

High BW, very expensive memory  = 3  relatively high BW  = 10  very expensive BW Optimum price/performance is at  =0.58, not  =3 Improvement in price/performance is ~50%

Low-BW, expensive BW  = 0.1  low BW application  = 3  moderately expensive BW Optimum price/performance is at  =0.18, not  =0.1 Improvement in price/performance is ~5% More BW helps here even though it is expensive, because the application does not need much.

Medium BW, expensive BW  = 1  modest BW  = 3  moderately expensive BW Optimum price/performance is at  =0.58, not  =1 Improvement in price/performance is ~10%

Summary Balance is important to cost/performance You must understand performance You must understand cost Optimum cost-performance is not intuitive!