T-SPaCS – A Two-Level Single-Pass Cache Simulation Methodology + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Wei Zang.

Slides:

Advertisements

Similar presentations

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…

Advertisements

1 Utility-Based Partitioning of Shared Caches Moinuddin K. Qureshi Yale N. Patt International Symposium on Microarchitecture (MICRO) 2006.

Evaluation and Validation

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Approximate Spatial Query Processing Using Raster Signatures Leonardo Guerreiro Azevedo, Rodrigo Salvador Monteiro, Geraldo Zimbrão & Jano Moreira de Souza.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

SE-292 High Performance Computing

CS 105 Tour of the Black Holes of Computing

Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.

1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Managing Web server performance with AutoTune agents by Y. Diao, J. L. Hellerstein, S. Parekh, J. P. Bigu Jangwon Han Seongwon Park

OS-aware Tuning Improving Instruction Cache Energy Efficiency on System Workloads Authors : Tao Li, John, L.K. Published in : Performance, Computing, and.

1 Parallel Algorithms (chap. 30, 1 st edition) Parallel: perform more than one operation at a time. PRAM model: Parallel Random Access Model. p0p0 p1p1.

Chapter 4 Memory Management Basic memory management Swapping

ABC Technology Project

© 2009 IBM Corporation IBM Research Xianglong Liu 1, Junfeng He 2,3, and Bo Lang 1 1 Beihang University, Beijing, China 2 Columbia University, New York,

1 Overview Assignment 4: hints Memory management Assignment 3: solution.

1 A Case for MLP-Aware Cache Replacement International Symposium on Computer Architecture (ISCA) 2006 Moinuddin K. Qureshi Daniel N. Lynch, Onur Mutlu,

Cache and Virtual Memory Replacement Algorithms

Project : Phase 1 Grading Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values.

Module 10: Virtual Memory

Chapter 10: Virtual Memory

ULC: An Unified Placement and Replacement Protocol in Multi-level Storage Systems Song Jiang and Xiaodong Zhang College of William and Mary.

A Survey of Web Cache Replacement Strategies Stefan Podlipnig, Laszlo Boszormenyl University Klagenfurt ACM Computing Surveys, December 2003 Presenter:

Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.

1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.

Ideal Parent Structure Learning School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan with Iftach Nachman and Nir.

Chapter 6 File Systems 6.1 Files 6.2 Directories

Virtual Memory In this lecture, slides from lecture 16 from the course Computer Architecture ECE 201 by Professor Mike Schulte are used with permission.

Chapter 5 Test Review Sections 5-1 through 5-4.

Addition 1’s to 20.

25 seconds left…...

SE-292 High Performance Computing

We will resume in: 25 Minutes.

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

CS/COE1541: Introduction to Computer Architecture Datapath and Control Review Sangyeun Cho Computer Science Department University of Pittsburgh.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

10/20: Lecture Topics HW 3 Problem 2 Caches –Types of cache misses –Cache performance –Cache tradeoffs –Cache summary Input/Output –Types of I/O Devices.

Multiplication Facts Practice

Computational Facility Layout

Graeme Henchel Multiples Graeme Henchel

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

Dynamic Phase-based Tuning for Embedded Systems Using Phase Distance Mapping + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Minimum Effort Design Space Subsetting for Configurable Caches + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work.

Tosiron Adegbija and Ann Gordon-Ross+

Ann Gordon-Ross and Frank Vahid*

CS 3410, Spring 2014 Computer Science Cornell University

Cache - Optimization.

Automatic Tuning of Two-Level Caches to Embedded Applications

10/18: Lecture Topics Using spatial locality

Presentation transcript:

T-SPaCS – A Two-Level Single-Pass Cache Simulation Methodology + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Wei Zang and Ann Gordon-Ross + University of Florida Department of Electrical and Computer Engineering

Power hungry caches are a good candidate for optimizations Different applications have vastly different cache requirements –Configure cache parameters: size, line size, associativity –Cache parameters that do not match an application’s behavior can waste over 60% of energy (Gordon-Ross 05) Cache tuning –Determine appropriate cache parameters (cache configuration) to meet optimization goals (e.g., lowest energy) –Difficult to determine the best cache configuration given very large design spaces for highly configurable caches Introduction 2 line size associativity size

Simulation-Based Cache Tuning Cache tuning at design time via simulation –Performed by the designer –Typically iterative simulation using exhaustive or heuristic methods 3 Instruction Set Simulator Embedded Application Simulating with c 1 Miss rate with c 1... Simulating with c 2 Simulating with c 3 Simulating with c n Miss rate with c 2 Miss rate with c 3 Miss rate with c n Lowest energy c 3 …very time consuming (setup and simulation time)… C 1,C 2,C 3,…,C n are the n cache configurations in design space

Simultaneously evaluate multiple cache configurations during one execution –Trace-driven cache simulation Use memory reference trace Generate trace file through single functional simulation Single-Pass Cache Tuning 4 Embedded Application Single-pass trace-driven cache simulation Miss rate with c 3 Miss rate with c 1 Miss rate with c 2 Miss rate with c n... Speedup simulation time Lowest energy c 3

Previous Work in Single-Pass Simulation Stack-based algorithm –Stack data structure stores access trace –State-of-the-art: 14X speedup over iterative (Viana 08) Tree data structure-based algorithm –Decreased simulation time –Complex data structures, more storage requirements Limitation 5... Processor L1 cache Main Mem L2 cache Processor L1 cache Main Mem Becoming more popular

Contributions Two-level Single-Pass trace-driven Cache Simulation methodology – T-SPaCS Use a stack-based algorithm to simulate both the level one and level two caches simultaneously Accurately determine the optimal energy cache configuration with low storage and simulation time complexity 6

Single-level Cache Simulation Stack-based single-pass trace-driven cache simulation for single-level cache 7 One cache configuration in design space: block size = 4 (2 2 ), number of cache sets = 8 (2 3 ) Stack Trace addresses Block offset Index tag (001) Processing address No previous access in stack Search stack Compulsory miss Stack update (001)

Single-level Cache Simulation Stack-based single-pass trace-driven cache simulation for single-level cache 8 One cache configuration in design space: block size = 4 (2 2 ), number of cache sets = 8 (2 3 ) Stack Trace addresses Block offset Index tag (010) Processing address No previous access in stack Search stack Compulsory miss (001) Stack update (010)

Single-level Cache Simulation Stack-based single-pass trace-driven cache simulation for single-level cache 9 One cache configuration in design space: block size = 4 (2 2 ), number of cache sets = 8 (2 3 ) Stack Trace addresses Block offset Index tag (101) Processing address No previous access in stack Search stack Compulsory miss (001) (010) Stack update (101)

Single-level Cache Simulation Stack-based single-pass trace-driven cache simulation for single-level cache 10 One cache configuration in design space: block size = 4 (2 2 ), number of cache sets = 8 (2 3 ) Stack Trace addresses Block offset Index tag (111) Processing address No previous access in stack Search stack Compulsory miss (001) (010) (101) Stack update (111)

Single-level Cache Simulation Stack-based single-pass trace-driven cache simulation for single-level cache 11 One cache configuration in design space: block size = 4 (2 2 ), number of cache sets = 8 (2 3 ) Stack Trace addresses Block offset Index tag (110) Processing address No previous access in stack Search stack Compulsory miss (001) (010) (101) (111) Stack update (110)

Single-level Cache Simulation Stack-based single-pass trace-driven cache simulation for single-level cache 12 One cache configuration in design space: block size = 4 (2 2 ), number of cache sets = 8 (2 3 ) Stack Trace addresses Block offset Index tag Processing address (001) (010) (101) (111) (110) Search stack same block (010) (111) Conflicts: blocks that map to the same cache set as processed address Conflicts # = 1 cache associativity >= 2, hit Stack update

Two-Level Cache Simulation Stack-based single-level cache simulation maintains one stack to record L1 access trace Naïve adaption of stack-based single-level cache simulation to two-level caches requires multiple stacks –Assumes inclusive cache hierarchy –L1 access trace: one stack based on memory reference trace –L2 access trace: depends on L1 miss –Requires n stacks for n L1 configurations –Disadvantage: large storage space and lengthy simulation time To reduce storage space and simulation time Exclusive cache hierarchy! 13

Inclusive vs. Exclusive Hierarchy 14 Inclusive Operation (L1/L2 LRU)Exclusive Operation (L1 LRU, L2 FIFO-like) TraceL1 (2-way)L2 (2-way)Hit/miss TraceL1 (2-way)L2 (2-way)Hit/miss BB A L1/L2 miss AA BB AL1 hit CC AC BL1/L2 miss AA BB AL1/L2 miss AA BL1 hit CC ABL1/L2 miss BB C L2 hit BB CAL2 hit L1 hits do not access L2 L2 access is decided by L1 Seperate L1 and L2 Combined cache L1 c 1 L1 c 2 L1 c 3 L1 c n... L1 stack L2 stacks L1 stack Simulate L1 & combined cache and derive L2 cache One Stack! Reduced storage space and simulation time AAAL1/L2 miss

T-SPaCS Overview 15 Execute application Access trace file T-SPaCS Stack Stack processing for conflicts for each B and S 1, S 2 L1 analysis based on conflicts # for all W 1 L1 miss L2 analysis Stack update T[t] is L1 hit/miss T[t] is L2 hit/miss T[t] L2 analysis T[N] : T[t] : T[3] T[2] T[1] Cache config. in design space Accumulated L1 & L2 misses for all cache config. B: block size S 1 :number of sets in L1 S 2 : number of sets in L2 W 1 : number of associativities in L1 W 2 : number of associativities in L2

L2 Analysis Stack processing for combined cache –Conflict evaluation (same as single-level cache) Compare-exclude operation to derive L2 conflicts –Conflicts for combined cache still contain some conflicts stored in L1 –Isolate the exclusive L2 conflicts –Based on three different inclusion relationships; consider as three scenarios 16 Scenario 1: S 1 = S 2 Conflicts for combined cache L1 conflicts L2 conflicts Scenario 2: S 1 < S 2 L1 conflicts L2 conflicts Scenario 3: S 1 > S 2 L1 conflicts L2 conflicts S 1 :number of sets in L1 S 2 : number of sets in L2

X1 Scenario 1: S 1 = S 2 17 Trace X1X2X3X4X1 L1 set (2 ways) L2 set (2 ways) Stack Access X1Conflicts: X4 X3 X2 L1 miss when W 1 =2 Blocks in L1L2 conflicts L2 conflicts # =1, L2 hit when W 2 >=2 X2 X3 X4 X1X2X3X4X1 S 1 :number of sets in L1 S 2 : number of sets in L2

Scenario 2: S 1 < S 2 18 L2 set (2 ways) Trace X1Y1X2X3Y4X1 L2 set (2 ways) L1 set (2 ways) Stack Access X1 L1 Conflicts: Y4 X3 X2 Y1 Conflicts for combined cache: X3 X2 Blocks in L1 L2 conflicts L2 conflicts # =1, L2 hit when W 2 >=2 L1 miss when W 1 =2 X1Y1X2X3Y4X1 Y1 X2 X3 Y4 S 1 :number of sets in L1 S 2 : number of sets in L2

Y2 Y1 X1 X2 X3 X4 X5 Special Case in Scenario 2 19 Trace X4Y1X2X3 X2 X1 X5 Y2 X5 L2 set (4 ways) L1 set (2 ways) Stack Access X5 BLK Occupied blank (fetching X2 evicted Y1 that maps to different L2 set) From cache: miss in L1/L2 From compare-exclude operation: bit-array... Blocks in L1: X2 Y2 Conflicts for combined cache : X2 X1 X3 X4 L2 conflicts L2 conflicts # =3 < 4, L2 hit ! Solution: occupied blank labeling o Bit-array to label BLK, ‘set’ bit: an BLK follows labeled address. o In processing X2, label BLK with the W 2 –th L2 conflict(X4). o In processing X5, detected BLK in the bit-array of X4. (i.e., X4 is the last block in L2). X5 is L2 miss. X4Y1X2X3 X2 X1 X5 Y2 X5 Access X2 Hit in L2 X5 S 1 :number of sets in L1 S 2 : number of sets in L2 Inaccurate! L2 conflicts should count BLK after X4

X4 Y3 X3 Y2 X2 Y1 X1 Scenario 3: S 1 > S 2 20 L2 set (4 ways) Trace X1Y1X2 X3 X4X1 Y2 Y3 L1 set (2 ways) Stack Access X1 (Complimentary set) L1 Conflicts: X4 X3 X2 Blocks in L1 L1 miss when W 1 =2 Conflicts for complimentary set: Y3 Y2 Y1 Blocks in complimentary set Conflicts for combined cache: X4 Y3 X3 Y2 X2 Y1 L2 conflicts L2 conflicts # =2, L2 hit when W 2 >=3 X1Y1X2 X3 X4X1 Y2 Y3 S 1 :number of sets in L1 S 2 : number of sets in L2

Accelerate Stack Processing Stack processing: very time consuming! Conflicts for one L1 configuration repeatedly compared with conflicts for all L2 configurations Save conflicts in a tree structure for later reference 21 Store conflicts with “10” index S min Conflict Evaluation Stack address Conflict S Store in tree node Next stack address S 1 :number of sets in L1 S 2 : number of sets in L2

Experiment Setup Design space –L1: cache size (2k  8k bytes); block size (16B  64B); associativity (direct-mapped  4-way) –L2: cache size (16k  64k bytes); block size (16B  64B); associativity (direct-mapped  4-way) –243 configurations Exclusive cache requires L1 and L2 to have the same block size 24 benchmarks from EEMBC, Powerstone, and MediaBench Modified SimpleScalar’s ‘sim-fast’ to generate access traces Modify SimpleScalar’s ‘sim-cache’ to simulate exclusive hierarchy cache to produce the exact miss rates for comparison Build energy model to determine optimal cache configuration with minimum energy consumption (Gordon-Ross 09) 22

Results – Miss Rate Accuracy L1 miss rate –100% accurate for all benchmarks L2 miss rate –Accurate for 240 configurations (99% of the design space) –Across all benchmarks Inaccuracy comes from Scenario 3: S 1 > S 2 –Reason Multiple L1 sets evict blocks in the same L2 set Eviction order is not consistent to access order –Introduced error is small Tuning accuracy: accurately determined energy optimal cache! 23 Max. average miss rate err.Max. standard deviationMax. absolute miss rate err. 1.16%0.64%1.55% S 1 :number of sets in L1 S 2 : number of sets in L2

Simplified-T-SPaCS Omit occupied blank labeling to reduce complexity and simulation time Tradeoff – additional miss rate error –L2 miss rate errors for additional 228 configurations where S 1 < S 2 (95% of the design space) –Across all benchmarks Tuning accuracy: accurately determined energy optimal cache! 24 Max. average miss rate err.Max. standard deviationMax. absolute miss rate err. 0.71%0.90%3.35% S 1 :number of sets in L1 S 2 : number of sets in L2

Simulation Time Efficiency 25 Max 18X Avg 8X Max 24.7X Avg 15.5X

Conclusions T-SPaCS simulates a two level instruction cache with exclusive hierarchy in a single-pass T-SPaCS reduces the storage and time complexity –T-SPaCS is 8X faster than iterative simulation on average –Simplified-T-SPaCS increases average simulation speedup to 15X at the expense of inaccurate miss rates for 95% of the design space –Both T-SPaCS and simplified-T-SPaCS can determine accurate optimal energy configurations Our ongoing work extends T-SPaCS to simulate data and unified cache, and implement in hardware for dynamic cache tuning 26