Automatic Measurement of Instruction Cache Capacity in X-Ray

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Systematic Energy Characterization of CMP/SMT Processor Systems.
Instruction Level Parallelism (ILP) Colin Stevens.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Storage Allocation for Embedded Processors By Jan Sjodin & Carl von Platen Present by Xie Lei ( PLS Lab)
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.
An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
ISA's, Compilers, and Assembly
Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:
1 Lecture 5a: CPU architecture 101 boris.
Procedures Procedures are very important for writing reusable and maintainable code in assembly and high-level languages. How are they implemented? Application.
Virtualization.
Topics to be covered Instruction Execution Characteristics
Vivek Seshadri 15740/18740 Computer Architecture
Visit for more Learning Resources
William Stallings Computer Organization and Architecture 8th Edition
Optimization Code Optimization ©SoftMoore Consulting.
The Problem Finding a needle in haystack An expert (CPU)
5.2 Eleven Advanced Optimizations of Cache Performance
Logistic Regression and Perceptron Prediction of Instruction Branches
Lecture 5: GPU Compute Architecture
COMP4211 : Advance Computer Architecture
Hyperthreading Technology
Chapter 6: CPU Scheduling
Methodology of a Compiler that Compresses Code using Echo Instructions
Process management Information maintained by OS for process management
Computer Architecture: Multithreading (I)
Department of Computer Science University of California, Santa Barbara
Lecture 5: GPU Compute Architecture for the last time
CPU Scheduling Basic Concepts Scheduling Criteria
Chapter 6: CPU Scheduling
Chapter 5: CPU Scheduling
Chapter 4: Threads.
Performance Optimization for Embedded Software
Optimizing MMM & ATLAS Library Generator
Chapter 6: CPU Scheduling
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
A Comparison of Cache-conscious and Cache-oblivious Codes
Sampoorani, Sivakumar and Joshua
What is Computer Architecture?
Embedded System Development Lecture 7 2/21/2007
What is Computer Architecture?
Adapted from the slides of Prof
What is Computer Architecture?
Chapter 12 Pipelining and RISC
What does it take to produce near-peak Matrix-Matrix Multiply
Chapter 6: CPU Scheduling
Cache-oblivious Programming
Lecture 4: Instruction Set Design/Pipelining
Department of Computer Science University of California, Santa Barbara
rePLay: A Hardware Framework for Dynamic Optimization
Chapter 6: CPU Scheduling
Performance.
Lecture 11: Machine-Dependent Optimization
Introduction to Computer Systems Engineering
What Are Performance Counters?
CSE378 Introduction to Machine Organization
Presentation transcript:

Automatic Measurement of Instruction Cache Capacity in X-Ray Kamen Yotov kyotov@us.ibm.com IBM T. J. Watson Research Center Joint work with: Tyler Steele, Sandra Jackson, Keshav Pingali, Paul Stodghill Department of Computer Science Cornell University 11/23/2018 QEST'05

Motivation: self-optimizing software Goal: portable performance Self-optimizing software Generates code with parameters whose optimal values depend on the platform (hardware / OS / compiler) Determines experimentally optimal parameter values Uses native C compiler to produce library Examples: ATLAS, FFTW, SPIRAL, … 11/23/2018 QEST'05

Example: Register Blocking for MMM Hardware parameters Number of FP registers (NR) I-Cache Capacity (ICC) A simple model for the register tile size for MMM Yotov et al. IEEE’05 MU x NU + MU + NU + Temp ≤ NR KU (unroll of K loop) does not depend on NR depends on ICC Need to know NR and ICC! 11/23/2018 QEST'05

Why not consult the manuals? Self-optimizing systems Require online manuals Actual hardware values vs. number available for optimization For software optimization, hardware values may not be relevant (e.g.) number of hardware registers may not be equal to number of registers available for holding program values (register 0 on SPARC) Incomplete Parameters like capacity and line size of off-chip caches vary from model to model Even same model of computer may be shipped with different cache organizations Not usually documented in processor manuals Moving Target Get rid of reg0 on sparc and talk about compiler specialized registers and volatility with changing compiler options 11/23/2018 QEST'05

Automatic Measurement Tools lmbench OS benchmark, some CPU / Memory benchmarks Larry McVoy, BitMover, Inc. Carl Staelin, HP Calibrator Memory hierarchy benchmark Stefan Manegold Centrum voor Wiskunde en Informatica MOB Josep Blanquer, Robert Chalmers University of California Santa Barbara 11/23/2018 QEST'05

X-Ray Set of micro-benchmarks in ANSI C89 Download and compile on any architecture (portable) Deduce hardware parameter values from timing results Some amount of O/S specific code High-resolution timing routines Super-page allocation Currently support Linux Windows and Solaris, IRIX, and AIX in the works Paradox Compiler optimizations may contaminate timing results Cannot afford to turn off all optimizations Autoconf OS 11/23/2018 QEST'05

Example: Latency of Integer ADD (Step by Step) t = gettime(); r1 += r2; return gettime() – t; Problem: hard to measure small time intervals accurately 11/23/2018 QEST'05

Step by Step (cont.) Problem: loop overhead t = gettime(); while (--R) //R is number of repetitions r1 += r2; return gettime() – t; Problem: loop overhead 11/23/2018 QEST'05

Step by Step (cont.) Problem: compiler optimizations t = gettime(); i = R / U; while (--i) //loop unrolled U times { r1 += r2; ........ } return gettime() – t; Problem: compiler optimizations 11/23/2018 QEST'05

Step by Step (cont.) Solution: “volatile int v = 0” t = gettime(); i = R / U; switch (v) { case 0: loop: case 1: r1 += r2; case 2: r1 += r2; ................. case U: r1 += r2; if (--i) goto loop; } if (!v) return gettime() – t; else use(r1,r2); Solution: “volatile int v = 0” 11/23/2018 QEST'05

Latency of integer ADD: nano-benchmark C code Want to measure r1+=r2 Generate C Code from specification <r1+=r2, <r1, r2: int>> volatile int v = 0; volatile int vr = 0; register int r1 = vr; register int r2 = vr; t = gettime(); i = R / U; switch (v) { case 0: loop: case 1: r1 += r2; case 2: r1 += r2; ................. case U: r1 += r2; if (--i) goto loop; } if (!v) return gettime() – t; else vr = r1; vr = r2; 11/23/2018 QEST'05

X-Ray architecture 11/23/2018 QEST'05

Instruction Throughput Specification Control Engine ILP instead of general parallelism in the CPU N=3, B=1: 11/23/2018 QEST'05

Micro-benchmarks in X-Ray CPU Frequency Instruction Latency Instruction Throughput Instruction Existence FPU on embedded processors FMA on general purpose processors SMP and SMT Memory Hierarchy Number of Registers of various types (int, float, SSE, …) Multilevel Caches, TLB Associativity Block Size Capacity Latency Instruction Cache Capacity 11/23/2018 QEST'05

Previous Approaches for Memory Hierarchy Parameters Saavedra Benchmark (Hennessy-Patterson) Accesses elements of an array constant stride apart Measures average memory access time Deficiencies Considers all levels simultaneously Works only for capacities that are powers-of-2 Suffers from a number of implementation level deficiencies Constant stride accesses Loop overhead problems Overlapping memory operations Prone to compiler “optimizations” 11/23/2018 QEST'05

Example: Isolation of lower cache levels Idea for Ln measurements Use sequences as for L1 measurements Make L1…Ln-1 “transparent” to measurements Unique in isolating the behavior of Ln so that all higher levels miss Approach Use sequences of sequences Convolution of sequences isolate behavior, unique, therefore sequence of sequences idea, mention convolution, state main theorem,  = 11/23/2018 QEST'05

Measuring I-Cache Capacity Approach for Data Cache does not work Array of pointers  Code sequence with branches Such branches are very predictable Nearly impossible to get precise timing Measure time to execute special code sequence of size N statements Find the biggest N for which there is no significant increase in time per statement 11/23/2018 QEST'05

Nano-benchmark Similar to Instruction Throughput Code size computed Parameters (1, 4) Grow length N Code size computed (char *)&&finish – (char *)&&start 11/23/2018 QEST'05

Sensitivity Graph for Pentium M Performance oscillates 9 more in the paper Performance oscillates Even after averaging out noise Cannot wait for jump Need more robust measurement 11/23/2018 QEST'05

Control Engine Script Start with N=256 Compute Binary-search Mean Standard deviation For Binary-search Detect jump when time is more than 11/23/2018 QEST'05

Experimental Results 11/23/2018 QEST'05

Pentium 4 Does not cache ISA instructions, but uops Trace cache Measure the number of instructions Smoothing in the nano-benchmark: minimum of time in 11/23/2018 QEST'05

Conclusions X-Ray: A framework and tool First to measure instruction cache capacity Algorithms for precise measurements of some important hardware parameters Experimental results on many modern architectures Other X-Ray resources Memory Hierarchy parameter measurement appeared at SIGMETRICS’05 CPU parameter measurement appeared at QEST’05 Improving X-Ray is work in progress… 11/23/2018 QEST'05

Current and Future Work 2-address vs. 3-address code Out-of-Order execution Number Physical registers Number / Type Functional Units Cache bandwidth write mode sharedness replacement policy 11/23/2018 QEST'05

Thank you! My E-Mail Cornell Group homepage kamen@yotov.org kyotov@us.ibm.com Cornell Group homepage http://iss.cs.cornell.edu This work emerged from a joint project with David Padua’s group at UIUC http://polaris.cs.uiuc.edu/newframework.html Download X-Ray! http://iss.cs.cornell.edu/software/x-ray.aspx 11/23/2018 QEST'05