Static Identification of Delinquent Loads V.M. Panait A. Sasturkar W.-F. Fong.

Slides:



Advertisements
Similar presentations
1 Lecture 3: MIPS Instruction Set Today’s topic:  More MIPS instructions  Procedure call/return Reminder: Assignment 1 is on the class web-page (due.
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 2: Data types and addressing modes dr.ir. A.C. Verschueren.
Idempotent Code Generation: Implementation, Analysis, and Evaluation Marc de Kruijf ( ) Karthikeyan Sankaralingam CGO 2013, Shenzhen.
INSTRUCTION SET ARCHITECTURES
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Lecture 12 Reduce Miss Penalty and Hit Time
There are two types of addressing schemes:
Lecture 6: MIPS Instruction Set Today’s topic –Control instructions –Procedure call/return 1.
ELEN 468 Advanced Logic Design
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Computer Organization and Architecture
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
ITEC 352 Lecture 27 Memory(4). Review Questions? Cache control –L1/L2  Main memory example –Formulas for hits.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Organization. This module surveys the physical resources of a computer system. –Basic components CPUMemoryBus I/O devices –CPU structure Registers.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
TK 2633 Microprocessor & Interfacing
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
Room: E-3-31 Phone: Dr Masri Ayob TK 2633 Microprocessor & Interfacing Lecture 1: Introduction to 8085 Assembly Language.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Multiscalar processors
1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)
Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
1/25 Pointer Logic Changki PSWLAB Pointer Logic Daniel Kroening and Ofer Strichman Decision Procedure.
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
1 Copyright © 2011, Elsevier Inc. All rights Reserved. Appendix A Authors: John Hennessy & David Patterson.
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
Chapter 4 MARIE: An Introduction to a Simple Computer.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
CSCI 136 Lab 1: 135 Review.
CE Operating Systems Lecture 14 Memory management.
Dynamic memory allocation and Pointers Lecture 4.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Microprocessors The ia32 User Instruction Set Jan 31st, 2002.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
CHAPTER 6 Instruction Set Architecture 12/7/
Chapter One Introduction to Pipelined Processors
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
Instruction Sets: Addressing modes and Formats Group #4  Eloy Reyes  Rafael Arevalo  Julio Hernandez  Humood Aljassar Computer Design EEL 4709c Prof:
COMPUTER ORGANIZATION AND ASSEMBLY LANGUAGE Lecture 21 & 22 Processor Organization Register Organization Course Instructor: Engr. Aisha Danish.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
D A C U C P Speculative Alias Analysis for Executable Code Manel Fernández and Roger Espasa Computer Architecture Department Universitat Politècnica de.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Announcements You will receive your scores back for Assignment 2 this week. You will have an opportunity to correct your code and resubmit it for partial.
Array and Pointers An Introduction Unit Unit Introduction This unit covers the usage of pointers and arrays in C++
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
Procedures (Functions)
Assembly Programming using MIPS R3000 CPU
CMSC 611: Advanced Computer Architecture
Understanding Program Address Space
Lecture 10: Branch Prediction and Instruction Delivery
Computer Instructions
Mastering Memory Modes
Instruction Set Principles
Assembly Programming using MIPS R3000 CPU
Where is all the knowledge we lost with information? T. S. Eliot
Instruction Set Architecture
Multi-Lookahead Offset Prefetching
Presentation transcript:

Static Identification of Delinquent Loads V.M. Panait A. Sasturkar W.-F. Fong

Agenda Introduction Related Work Delinquent Loads Framework Address Patterns, Decision Criteria The heuristic: types of classes, computing the weights, final classes Results

Introduction Cache – one of the major current bottlenecks in performance One approach: prefetch; but prefetch what ? Can’t prefetch everything… Few loads are really “bad” – “delinquent loads” This paper: classification of address patterns in the load instructions

Introduction Done after code generation, but before runtime Singled out 10% of all loads causing over 90% of the misses in 18 SPEC benchmarks Gets even better combined with basic block profiling: 1.3% loads covering over 80% of the misses

Related Work BDH method: classify loads based on following criteria: Region of memory accessed by the load: S (stack), H (heap) or G (global). Kind of reference: loading a scalar (S), element of array (A) or field of a structure (S) Type of reference: (P)ointer or (N)ot.

Related Work Some classes account for most misses: GAN, HSN, HFN, HAN, HFP, HAP. The OKN method: 3 simple heuristics Use of a pointer dereference Use of a strided reference None of the above This paper is much more precise than both above methods

Delinquent Loads Why not stores too ? Write buffers are apparently good enough Why not do it in hardware ? They do, but: Need additional specialized hardware Complex decisions (fast) complex hardware Memory profiling: not always practical

Delinquent Loads & Profiling

Framework Assembly code -> address patterns for each load instruction -> placement of the load instruction in a class Classes + weights -> heuristic function If the value of the heuristic is greater than a delinquency threshold, the instruction is classified as possibly delinquent

Address Patterns Address Pattern = summary of how the source address of the load instruction is computed Uses CFG and DF analysis (reaching definitions) (one address pattern for each control path reaching the load) Only uses basic registers (BR): gp, sp, reg param, reg ret

The Decision Criteria Classes are derived from these criteria H1: Register usage in an address pattern (usage of BR’s) H2: Type of operations used in address computation (arithmetic, logic) H3: Maximum level of dereferencing

The Decision Criteria H4: Recurrence (iterative walk through memory) H5: Execution frequency – based on BB profiling; classifies loads as: Rarely executed (used here as negative) Seldom executed (idem) Fairly often executed (not used here) In a program hotspot

Decision Criteria and Classes Each criterion results in a set of classes Class = set of address patterns with a certain property There are too many classes that can result; only some are considered, and some of those are also aggregated into one class

Decision Criteria and Classes H1 – based classes: enumerations of the number of occurrences of each of the 4 BR’s in an address pattern H2 – based classes: address patterns with multiplications and shift operations H3 – based classes: as many as there are levels of dereferencing in the address patterns

Decision Criteria and Classes H4 – based classes: two classes (address pattern involves recurrence or not) H5 – based classes: three classes: rarely, seldom and program hotspot

Experimental Setup SimpleScalar toolkit: cache simulator (for cache hits & misses), compiler, objdump Procedure: Fortran -> C code (via f2c) -> MIPS executable (via C2MIPS compiler) -> disassembled code (via objdump) Reconstruction of CFG and DF analysis

Experimental Setup 2 stages: learning/training and experimental (actual) Stage 1: get full memory profiling data on a subset of SPEC benchmarks, use it to compute weights for each class Use the heuristic thus obtained on a new subset of benchmarks

The Heuristic: Types of Classes Three types of classes: Positive (loads in it are likely delinquent) Negative (… not …) Neutral Positive classes have positive weights, negative ones have negative weights, neutral classes have a weight of zero

The miss probability of class F in benchmark j: The amount of misses accounted for by members of class F in benchmark j: The Heuristic: Terminology

m j (F,C) = likelihood of an instruction of class F in benchmark j to be a cache miss However, if that instruction is only executed once, it won’t be a delinquent load n j (F,C) = proportion out of total number of misses that members of F account for

The Heuristic: Terminology Strength index: r = m j / n j A benchmark j is irrelevant to a class F if both indices m j and n j are below certain thresholds. Otherwise it is relevant. Positive class: r > 5% for all benchs. Negative class: n j < 0.5% for all benchs. Neutral class: r < 5% for 1+ benchs.

Computing the Weights Form classes according to the five decision criteria Compute m j, n j for each class Weight of class F k

Computing the Weights This is the formula for positive classes only Only relevant benchmarks are included in the formula |.| is the cardinality of that set, i.e. the number of benchmarks relevant to that class

Aggregate Classes AG1: both gp and sp are used 1+ each (comes from H1) AG2: only sp used 2+ (H1) AG3: either * or shifts are used (H2) AG4: one level dereferencing (H3) AG5: two level dereferencing (H3) AG6: three level dereferencing (H3)

Aggregate Classes AG7: address patterns containing a recurrence (H4) AG8: loads with low frequency of execution (100 < f < 1000) (H5) AG9: loads with fairly low frequency of execution (f < 100 times) (H5) Weight formula for negative classes: negated mean of positive weights

The Heuristic Function 1 if 0 otherwise the load is delinquent

Precision and Coverage Precision of a heuristic scheme H,  (H): the (correct) number of loads that scheme H identifies as delinquent (the lower, i.e., closer to the real one, the better) Coverage of a heuristic scheme H,  (H): the number of cache misses caused by loads identified as delinquent by scheme H (the closer to 100%, the better)

Results on different inputs

Results when varying cache associativity

Results when varying cache size

Performance on new benchmarks

Performance summary

Performance of OKN & BDH

Performance with various 

Combination with BB profiling Use the heuristic to sharpen the set returned by BB profiling Also add loads that are not in the hotspots  is the percentage of the highest scoring loads detected by our method but not by profiling that we consider to be delinquent

Combination with BB profiling

Conclusions The static scheme for identifying delinquent loads has a precision of 10% and coverage of over 90% over 18 benchmarks More precise than related work, similar coverage Immune to variation of framework parameters (e.g. cache size, assoc., input)