Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.

Slides:



Advertisements
Similar presentations
Memory.
Advertisements

Chapter 11 – Virtual Memory Management
Part IV: Memory Management
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
1 Virtual Memory Management B.Ramamurthy. 2 Demand Paging Main memory LAS 0 LAS 1 LAS 2 (Physical Address Space -PAS) LAS - Logical Address.
1 Thursday, July 06, 2006 “Experience is something you don't get until just after you need it.” - Olivier.
Segmentation and Paging Considerations
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
CS 153 Design of Operating Systems Spring 2015
CS 333 Introduction to Operating Systems Class 12 - Virtual Memory (2) Jonathan Walpole Computer Science Portland State University.
1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.
Memory Management (II)
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.
Virtual Memory:Part 2 Kashyap Sheth Kishore Putta Bijal Shah Kshama Desai.
1 Virtual Memory Management B.Ramamurthy Chapter 10.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
CS 333 Introduction to Operating Systems Class 12 - Virtual Memory (2) Jonathan Walpole Computer Science Portland State University.
A Low-Power Low-Memory Real-Time ASR System. Outline Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
CS333 Intro to Operating Systems Jonathan Walpole.
Page 19/17/2015 CSE 30341: Operating Systems Principles Optimal Algorithm  Replace page that will not be used for longest period of time  Used for measuring.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
1 Memory Management 4.1 Basic memory management 4.2 Swapping 4.3 Virtual memory 4.4 Page replacement algorithms 4.5 Modeling page replacement algorithms.
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Dynamic Phase-based Tuning for Embedded Systems Using Phase Distance Mapping + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Operating Systems Unit 7: – Virtual Memory organization Operating Systems.
Swap Space and Other Memory Management Issues Operating Systems: Internals and Design Principles.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.
Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.
Chapter 7: Main Memory CS 170, Fall Program Execution & Memory Management Program execution Swapping Contiguous Memory Allocation Paging Structure.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
CMSC 611: Advanced Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Virtual memory.
Chapter 2 Memory and process management
Associativity in Caches Lecture 25
ITEC 202 Operating Systems
Lecture 10: Virtual Memory
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
CMSC 611: Advanced Computer Architecture
Tosiron Adegbija and Ann Gordon-Ross+
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 8 11/24/2018.
Phase Capture and Prediction with Applications
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 12/1/2018.
A Self-Tuning Configurable Cache
CSE451 Virtual Memory Paging Autumn 2002
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 4/5/2019.
Phase based adaptive Branch predictor: Seeing the forest for the trees
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis

Introduction When a program executes, it passes through phases where performance characteristics and hardware resources may vary. If these phases changes can be detected, optimizations for performance and/or power can be applied and dynamic reconfiguration can be invoked. Configurable units can have a fixed number of configurations e.g. four different cache sizes. This paper appose many configuration algorithms that can be applied to any multi-configuration unit.

Dynamically configurable hardware Configurable caches and TLBs Allocation of memory hierarchy resources Allocation of memory buffer resources Configurable branch predictors Configurable instruction windows Configurable pipelines A combination of these methods can be used. Complexity of the optimization problem increases, especially if the methods interact with one another.

Dynamic reconfiguration algorithm

Dynamic reconfiguration algorithm(2) Refer as Rochester Algorithm, used to control multi- configuration data cache Interval 100K instructions 2 states: Stable, Unstable Tuning and reconfiguration in unstable state Dynamically the threshold change to detect phase change Detection efficiency (ability to detect phase change) Reconfiguration overhead (depends on the amount of state) (10 cycles to 1000 e.g. data cache). Tuning Overhead (time to find optimal configuration). Each tuning can lead to multiple reconfigurations. Why we study working sets? Because phase changes are manifestations of working set changes.

Working with working set Working set W(ti,τ) for i=1,2…, set of distinct segments {s1,s2…,sω} touched over the i th window (interval) of size τ Working set (w.s.) size is ω, cardinality of the set. Segments are memory regions of some fixed size (page) General model, phase transition model. Phase is defined as the maximal interval over which the working set remains more or less constant. The program follow a series of steady state phases with abrupt transitions in between. Focus on the instruction working set (distinguish data, instr) Window size is important to capture a working set. In this paper, working set contains cache line granularity sized elements ( bytes), units are caches and predictors Use of non-overlapping window

Working with working set(2) Main goal identify w.s, measure size & detect change in w.s. Measure to compare two phases with w.s. W(ti,τ) and W(tj,τ) Relative working set distance. Large δ value indicates a w.s. change, a small no change. If δ=0 the sets identical, δ=1 sets are totally different. Define threshold δ th, there is a w.s change if δ> δ th

Working set signature Difficult to manipulate complete working sets A lossy-compressed w.s representation,w.s signature (w.s.s) W.s.s is a n-bit vector mapping w.s.elements into n-buckets Use of random hash function (srand and rand) Low-order b address bit ignored in hash (w.s. elements cache line granularity) Range of bit-vector bytes Bit-vector is cleared at the begging of every interval.

Working set signature(2)

Working set signature(3) The size of the signature is related to the w.s size. K random keys are hashed into n buckets, the fraction of buckets filled, f, W.s.size can be estimated 90% filled table corresponds w.s.size about 2.5 times larger than the number of filled entries. The measure of similarity of two signatures S1,S2, the relative signature distance is defined as Use a threshold value Δ th to detect phase changes

3. Methology Modified version SimpleScalar and SPEC2000 benchmarks Compile using base-level optimization Choice of benchmarks 1)long and short term phase with differing performance 2)recurring phases (test w.s. identification) 3)different w.s in a benchmark that lead to similar behavior for certain cache/predictor configuration & completely different behavior (test reconfiguration) 100K instructions/interval, 20,000 intervals (2 billion intst) Signature vector size is 1024 bits(128 bytes)

Signature accuracy Evaluate accuracy of w.s.s distance with comparison full w.s Measure the relative distances between pairs of consecutive windows

Signature accuracy (2) Rochester algorithm uses dynamic count of conditional branches to measure w.s.changes Relative distance metric for conditional branches

Signature accuracy (3) Some correlation, high level dispersion. Several significant working set changes that are associated with very small relative branch Define Δ th = 0,5 which filters out most of the noise and detects significant phase changes. Phase changes is relatively insensitive to Δ th, because phase change tends to be abrupt.

Evaluation: managing configurable hardware Signature based algorithm

Evaluation: managing configurable hardware (2) Three states: Stable – program w.s is stable & configuration is optimal, Unstable – w.s is in transition, Tuning w.s is stable & different configurations explored. Similar with Rochester Algorithm, but the signature-based algorithm does not tune while the w.s is in transition Icache size configured to 2KB,8KB,32KB or 128KB Parameters for Rochester algorithm base_br_noise = 4500, br_dec = 50, br_inc = 1000, base_perf_noise = 450, perf_dec = 5, perf_inc = 100 and threshold = 2%

Evaluation: managing configurable hardware (3)

Evaluation: managing configurable hardware (4)

Evaluation: managing configurable hardware (5)

Evaluation: managing configurable hardware (6)

Evaluation: managing configurable hardware (7) To reduce unnecessary tunings, we extend the signature- based algorithm to wait for 4 stable intervals before tuning. If the state is UNSTABLE for more than 10 intervals and performance is below threshold, the cache size is increased to the maximum. This acts as a backup strategy in cases where the working set does not stabilize, so tuning is never performed.

4. Measuring working set sizes The signature size is closely related to the actual working set size. In cases where performance is directly related to the working set size, for example instruction and data caches, the signature size can be used to determine the optimal configuration; there is no need for tuning.

Working set size experiments For small w.s. the graph is close to linear and as it gets bigger the graph becomes non-linear. Even in the non-linear, the signature can give reasonably accurate w.s size estimates 3-4x the maximum signature size. So a typical signature size ( bytes) with line-size granularities ( bytes) can estimate w.s sizes of many tens to hundreds of KB

Evaluation: reconfiguration using signature size The extended signature-based algorithm can be modified to use the signature size for selecting an optimal cache configuration – the smallest that holds the current working set (plus 10% to allow for some noise). To determine the appropriate size, equation 2 (Section 2) is used. This eliminates the need to tune, and it typically reduces the number of reconfigurations as well. (signature size)

Identifying recurring phases The same phases often recur multiple times during program execution. Implement an algorithm to save recurring phase information to avoid re-tuning. This will be done by maintaining a phase table in memory. After tuning has determined the optimal configuration for a particular phase, it will be stored in the table. Later, if the phase recurs, the optimal configuration can be reinstated without going through the tuning process.

Phase statistics

Evaluation: recurring working sets The algorithm (phase table) for exploiting recurring working sets is similar to the basic algorithm. On detecting a phase change, the algorithm first performs a table lookup to see if configuration information for the phase exists in the table. If so, the optimal configuration is reinstated. If not, the algorithm goes into the TUNING state. At the end of tuning, the optimal configuration is committed to the signature table. In addition to the configuration information, the table also keeps track of phase lengths. If, during its last execution, the length was fewer than four intervals (400,000 instructions), then tuning is not performed. This avoids tuning for insignificant phases. Four intervals are chosen because the tuning process takes a maximum of four intervals.