Automatic Measurement of Instruction Cache Capacity in X-Ray

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Systematic Energy Characterization of CMP/SMT Processor Systems.

Instruction Level Parallelism (ILP) Colin Stevens.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

Storage Allocation for Embedded Processors By Jan Sjodin & Carl von Platen Present by Xie Lei ( PLS Lab)

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.

An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

ISA's, Compilers, and Assembly

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

1 Lecture 5a: CPU architecture 101 boris.

Procedures Procedures are very important for writing reusable and maintainable code in assembly and high-level languages. How are they implemented? Application.

Virtualization.

Topics to be covered Instruction Execution Characteristics

Vivek Seshadri 15740/18740 Computer Architecture

Visit for more Learning Resources

William Stallings Computer Organization and Architecture 8th Edition

Optimization Code Optimization ©SoftMoore Consulting.

The Problem Finding a needle in haystack An expert (CPU)

5.2 Eleven Advanced Optimizations of Cache Performance

Logistic Regression and Perceptron Prediction of Instruction Branches

Lecture 5: GPU Compute Architecture

COMP4211 : Advance Computer Architecture

Hyperthreading Technology

Chapter 6: CPU Scheduling

Methodology of a Compiler that Compresses Code using Echo Instructions

Process management Information maintained by OS for process management

Computer Architecture: Multithreading (I)

Department of Computer Science University of California, Santa Barbara

Lecture 5: GPU Compute Architecture for the last time

CPU Scheduling Basic Concepts Scheduling Criteria

Chapter 6: CPU Scheduling

Chapter 5: CPU Scheduling

Chapter 4: Threads.

Performance Optimization for Embedded Software

Optimizing MMM & ATLAS Library Generator

Chapter 6: CPU Scheduling

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

A Comparison of Cache-conscious and Cache-oblivious Codes

Sampoorani, Sivakumar and Joshua

What is Computer Architecture?

Embedded System Development Lecture 7 2/21/2007

What is Computer Architecture?

Adapted from the slides of Prof

What is Computer Architecture?

Chapter 12 Pipelining and RISC

What does it take to produce near-peak Matrix-Matrix Multiply

Chapter 6: CPU Scheduling

Cache-oblivious Programming

Lecture 4: Instruction Set Design/Pipelining

Department of Computer Science University of California, Santa Barbara

rePLay: A Hardware Framework for Dynamic Optimization

Chapter 6: CPU Scheduling

Lecture 11: Machine-Dependent Optimization

Introduction to Computer Systems Engineering

What Are Performance Counters?

CSE378 Introduction to Machine Organization

Presentation transcript:

Automatic Measurement of Instruction Cache Capacity in X-Ray Kamen Yotov kyotov@us.ibm.com IBM T. J. Watson Research Center Joint work with: Tyler Steele, Sandra Jackson, Keshav Pingali, Paul Stodghill Department of Computer Science Cornell University 11/23/2018 QEST'05

Motivation: self-optimizing software Goal: portable performance Self-optimizing software Generates code with parameters whose optimal values depend on the platform (hardware / OS / compiler) Determines experimentally optimal parameter values Uses native C compiler to produce library Examples: ATLAS, FFTW, SPIRAL, … 11/23/2018 QEST'05

Example: Register Blocking for MMM Hardware parameters Number of FP registers (NR) I-Cache Capacity (ICC) A simple model for the register tile size for MMM Yotov et al. IEEE’05 MU x NU + MU + NU + Temp ≤ NR KU (unroll of K loop) does not depend on NR depends on ICC Need to know NR and ICC! 11/23/2018 QEST'05

Why not consult the manuals? Self-optimizing systems Require online manuals Actual hardware values vs. number available for optimization For software optimization, hardware values may not be relevant (e.g.) number of hardware registers may not be equal to number of registers available for holding program values (register 0 on SPARC) Incomplete Parameters like capacity and line size of off-chip caches vary from model to model Even same model of computer may be shipped with different cache organizations Not usually documented in processor manuals Moving Target Get rid of reg0 on sparc and talk about compiler specialized registers and volatility with changing compiler options 11/23/2018 QEST'05

Automatic Measurement Tools lmbench OS benchmark, some CPU / Memory benchmarks Larry McVoy, BitMover, Inc. Carl Staelin, HP Calibrator Memory hierarchy benchmark Stefan Manegold Centrum voor Wiskunde en Informatica MOB Josep Blanquer, Robert Chalmers University of California Santa Barbara 11/23/2018 QEST'05

X-Ray Set of micro-benchmarks in ANSI C89 Download and compile on any architecture (portable) Deduce hardware parameter values from timing results Some amount of O/S specific code High-resolution timing routines Super-page allocation Currently support Linux Windows and Solaris, IRIX, and AIX in the works Paradox Compiler optimizations may contaminate timing results Cannot afford to turn off all optimizations Autoconf OS 11/23/2018 QEST'05

Example: Latency of Integer ADD (Step by Step) t = gettime(); r1 += r2; return gettime() – t; Problem: hard to measure small time intervals accurately 11/23/2018 QEST'05

Step by Step (cont.) Problem: loop overhead t = gettime(); while (--R) //R is number of repetitions r1 += r2; return gettime() – t; Problem: loop overhead 11/23/2018 QEST'05

Step by Step (cont.) Problem: compiler optimizations t = gettime(); i = R / U; while (--i) //loop unrolled U times { r1 += r2; ........ } return gettime() – t; Problem: compiler optimizations 11/23/2018 QEST'05

Step by Step (cont.) Solution: “volatile int v = 0” t = gettime(); i = R / U; switch (v) { case 0: loop: case 1: r1 += r2; case 2: r1 += r2; ................. case U: r1 += r2; if (--i) goto loop; } if (!v) return gettime() – t; else use(r1,r2); Solution: “volatile int v = 0” 11/23/2018 QEST'05

Latency of integer ADD: nano-benchmark C code Want to measure r1+=r2 Generate C Code from specification <r1+=r2, <r1, r2: int>> volatile int v = 0; volatile int vr = 0; register int r1 = vr; register int r2 = vr; t = gettime(); i = R / U; switch (v) { case 0: loop: case 1: r1 += r2; case 2: r1 += r2; ................. case U: r1 += r2; if (--i) goto loop; } if (!v) return gettime() – t; else vr = r1; vr = r2; 11/23/2018 QEST'05

X-Ray architecture 11/23/2018 QEST'05

Instruction Throughput Specification Control Engine ILP instead of general parallelism in the CPU N=3, B=1: 11/23/2018 QEST'05

Micro-benchmarks in X-Ray CPU Frequency Instruction Latency Instruction Throughput Instruction Existence FPU on embedded processors FMA on general purpose processors SMP and SMT Memory Hierarchy Number of Registers of various types (int, float, SSE, …) Multilevel Caches, TLB Associativity Block Size Capacity Latency Instruction Cache Capacity 11/23/2018 QEST'05

Previous Approaches for Memory Hierarchy Parameters Saavedra Benchmark (Hennessy-Patterson) Accesses elements of an array constant stride apart Measures average memory access time Deficiencies Considers all levels simultaneously Works only for capacities that are powers-of-2 Suffers from a number of implementation level deficiencies Constant stride accesses Loop overhead problems Overlapping memory operations Prone to compiler “optimizations” 11/23/2018 QEST'05

Example: Isolation of lower cache levels Idea for Ln measurements Use sequences as for L1 measurements Make L1…Ln-1 “transparent” to measurements Unique in isolating the behavior of Ln so that all higher levels miss Approach Use sequences of sequences Convolution of sequences isolate behavior, unique, therefore sequence of sequences idea, mention convolution, state main theorem,  = 11/23/2018 QEST'05

Measuring I-Cache Capacity Approach for Data Cache does not work Array of pointers  Code sequence with branches Such branches are very predictable Nearly impossible to get precise timing Measure time to execute special code sequence of size N statements Find the biggest N for which there is no significant increase in time per statement 11/23/2018 QEST'05

Nano-benchmark Similar to Instruction Throughput Code size computed Parameters (1, 4) Grow length N Code size computed (char *)&&finish – (char *)&&start 11/23/2018 QEST'05

Sensitivity Graph for Pentium M Performance oscillates 9 more in the paper Performance oscillates Even after averaging out noise Cannot wait for jump Need more robust measurement 11/23/2018 QEST'05

Control Engine Script Start with N=256 Compute Binary-search Mean Standard deviation For Binary-search Detect jump when time is more than 11/23/2018 QEST'05

Experimental Results 11/23/2018 QEST'05

Pentium 4 Does not cache ISA instructions, but uops Trace cache Measure the number of instructions Smoothing in the nano-benchmark: minimum of time in 11/23/2018 QEST'05

Conclusions X-Ray: A framework and tool First to measure instruction cache capacity Algorithms for precise measurements of some important hardware parameters Experimental results on many modern architectures Other X-Ray resources Memory Hierarchy parameter measurement appeared at SIGMETRICS’05 CPU parameter measurement appeared at QEST’05 Improving X-Ray is work in progress… 11/23/2018 QEST'05

Current and Future Work 2-address vs. 3-address code Out-of-Order execution Number Physical registers Number / Type Functional Units Cache bandwidth write mode sharedness replacement policy 11/23/2018 QEST'05

Thank you! My E-Mail Cornell Group homepage kamen@yotov.org kyotov@us.ibm.com Cornell Group homepage http://iss.cs.cornell.edu This work emerged from a joint project with David Padua’s group at UIUC http://polaris.cs.uiuc.edu/newframework.html Download X-Ray! http://iss.cs.cornell.edu/software/x-ray.aspx 11/23/2018 QEST'05