Automatic Measurement of Instruction Cache Capacity in X-Ray

Automatic Measurement of Instruction Cache Capacity in X-Ray
Kamen Yotov IBM T. J. Watson Research Center Joint work with: Tyler Steele, Sandra Jackson, Keshav Pingali, Paul Stodghill Department of Computer Science Cornell University 11/23/2018 QEST'05

Motivation: self-optimizing software
Goal: portable performance Self-optimizing software Generates code with parameters whose optimal values depend on the platform (hardware / OS / compiler) Determines experimentally optimal parameter values Uses native C compiler to produce library Examples: ATLAS, FFTW, SPIRAL, … 11/23/2018 QEST'05

Example: Register Blocking for MMM
Hardware parameters Number of FP registers (NR) I-Cache Capacity (ICC) A simple model for the register tile size for MMM Yotov et al. IEEE’05 MU x NU + MU + NU + Temp ≤ NR KU (unroll of K loop) does not depend on NR depends on ICC Need to know NR and ICC! 11/23/2018 QEST'05

Why not consult the manuals?
Self-optimizing systems Require online manuals Actual hardware values vs. number available for optimization For software optimization, hardware values may not be relevant (e.g.) number of hardware registers may not be equal to number of registers available for holding program values (register 0 on SPARC) Incomplete Parameters like capacity and line size of off-chip caches vary from model to model Even same model of computer may be shipped with different cache organizations Not usually documented in processor manuals Moving Target Get rid of reg0 on sparc and talk about compiler specialized registers and volatility with changing compiler options 11/23/2018 QEST'05

Automatic Measurement Tools
lmbench OS benchmark, some CPU / Memory benchmarks Larry McVoy, BitMover, Inc. Carl Staelin, HP Calibrator Memory hierarchy benchmark Stefan Manegold Centrum voor Wiskunde en Informatica MOB Josep Blanquer, Robert Chalmers University of California Santa Barbara 11/23/2018 QEST'05

X-Ray Set of micro-benchmarks in ANSI C89
Download and compile on any architecture (portable) Deduce hardware parameter values from timing results Some amount of O/S specific code High-resolution timing routines Super-page allocation Currently support Linux Windows and Solaris, IRIX, and AIX in the works Paradox Compiler optimizations may contaminate timing results Cannot afford to turn off all optimizations Autoconf OS 11/23/2018 QEST'05

Example: Latency of Integer ADD (Step by Step)
t = gettime(); r1 += r2; return gettime() – t; Problem: hard to measure small time intervals accurately 11/23/2018 QEST'05

Step by Step (cont.) Problem: loop overhead t = gettime();
while (--R) //R is number of repetitions r1 += r2; return gettime() – t; Problem: loop overhead 11/23/2018 QEST'05

Step by Step (cont.) Problem: compiler optimizations t = gettime();
i = R / U; while (--i) //loop unrolled U times { r1 += r2; } return gettime() – t; Problem: compiler optimizations 11/23/2018 QEST'05

Step by Step (cont.) Solution: “volatile int v = 0” t = gettime();
i = R / U; switch (v) { case 0: loop: case 1: r1 += r2; case 2: r1 += r2; case U: r1 += r2; if (--i) goto loop; } if (!v) return gettime() – t; else use(r1,r2); Solution: “volatile int v = 0” 11/23/2018 QEST'05

Latency of integer ADD: nano-benchmark C code
Want to measure r1+=r2 Generate C Code from specification <r1+=r2, <r1, r2: int>> volatile int v = 0; volatile int vr = 0; register int r1 = vr; register int r2 = vr; t = gettime(); i = R / U; switch (v) { case 0: loop: case 1: r1 += r2; case 2: r1 += r2; case U: r1 += r2; if (--i) goto loop; } if (!v) return gettime() – t; else vr = r1; vr = r2; 11/23/2018 QEST'05

X-Ray architecture 11/23/2018 QEST'05

Instruction Throughput
Specification Control Engine ILP instead of general parallelism in the CPU N=3, B=1: 11/23/2018 QEST'05

Micro-benchmarks in X-Ray
CPU Frequency Instruction Latency Instruction Throughput Instruction Existence FPU on embedded processors FMA on general purpose processors SMP and SMT Memory Hierarchy Number of Registers of various types (int, float, SSE, …) Multilevel Caches, TLB Associativity Block Size Capacity Latency Instruction Cache Capacity 11/23/2018 QEST'05

Previous Approaches for Memory Hierarchy Parameters
Saavedra Benchmark (Hennessy-Patterson) Accesses elements of an array constant stride apart Measures average memory access time Deficiencies Considers all levels simultaneously Works only for capacities that are powers-of-2 Suffers from a number of implementation level deficiencies Constant stride accesses Loop overhead problems Overlapping memory operations Prone to compiler “optimizations” 11/23/2018 QEST'05

Example: Isolation of lower cache levels
Idea for Ln measurements Use sequences as for L1 measurements Make L1…Ln-1 “transparent” to measurements Unique in isolating the behavior of Ln so that all higher levels miss Approach Use sequences of sequences Convolution of sequences isolate behavior, unique, therefore sequence of sequences idea, mention convolution, state main theorem,  = 11/23/2018 QEST'05

Measuring I-Cache Capacity
Approach for Data Cache does not work Array of pointers  Code sequence with branches Such branches are very predictable Nearly impossible to get precise timing Measure time to execute special code sequence of size N statements Find the biggest N for which there is no significant increase in time per statement 11/23/2018 QEST'05

Nano-benchmark Similar to Instruction Throughput Code size computed
Parameters (1, 4) Grow length N Code size computed (char *)&&finish – (char *)&&start 11/23/2018 QEST'05

Sensitivity Graph for Pentium M Performance oscillates
9 more in the paper Performance oscillates Even after averaging out noise Cannot wait for jump Need more robust measurement 11/23/2018 QEST'05

Control Engine Script Start with N=256 Compute Binary-search Mean
Standard deviation For Binary-search Detect jump when time is more than 11/23/2018 QEST'05

Experimental Results 11/23/2018 QEST'05

Pentium 4 Does not cache ISA instructions, but uops Trace cache
Measure the number of instructions Smoothing in the nano-benchmark: minimum of time in 11/23/2018 QEST'05

Conclusions X-Ray: A framework and tool
First to measure instruction cache capacity Algorithms for precise measurements of some important hardware parameters Experimental results on many modern architectures Other X-Ray resources Memory Hierarchy parameter measurement appeared at SIGMETRICS’05 CPU parameter measurement appeared at QEST’05 Improving X-Ray is work in progress… 11/23/2018 QEST'05

Current and Future Work
2-address vs. 3-address code Out-of-Order execution Number Physical registers Number / Type Functional Units Cache bandwidth write mode sharedness replacement policy 11/23/2018 QEST'05

Thank you! My E-Mail Cornell Group homepage
Cornell Group homepage This work emerged from a joint project with David Padua’s group at UIUC Download X-Ray! 11/23/2018 QEST'05

Automatic Measurement of Instruction Cache Capacity in X-Ray

Similar presentations

Presentation on theme: "Automatic Measurement of Instruction Cache Capacity in X-Ray"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Measurement of Instruction Cache Capacity in X-Ray

Similar presentations

Presentation on theme: "Automatic Measurement of Instruction Cache Capacity in X-Ray"— Presentation transcript:

Similar presentations

About project

Feedback