Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared.

Slides:



Advertisements
Similar presentations
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Advertisements

To Include or Not to Include? Natalie Enright Dana Vantrease.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.
1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.
Multiscalar processors
1  2004 Morgan Kaufmann Publishers Chapter Seven.
Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos.
Lecture 15: Virtual Memory EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.
On Transactional Memory, Spinlocks and Database Transactions Khai Q. Tran Spyros Blanas Jeffrey F. Naughton (University of Wisconsin Madison)
CS203 – Advanced Computer Architecture Virtual Memory.
Instruction Prefetching Smruti R. Sarangi. Contents  Motivation for Prefetching  Simple Schemes  Recent Work  Proactive Instruction Fetching  Return.
CS161 – Design and Architecture of Computer
CS161 – Design and Architecture of Computer
ASR: Adaptive Selective Replication for CMP Caches
Basic Performance Parameters in Computer Architecture:
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Morgan Kaufmann Publishers
William Stallings Computer Organization and Architecture 7th Edition
Lecture 10: Buffer Manager and File Organization
Address Translation for Manycore Systems
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Predictive Performance
CMPT 886: Computer Architecture Primer
Batches, Transactions, & Errors
M. Usha Professor/CSE Sona College of Technology
Batches, Transactions, & Errors
AB AC AD AE AF 5 ways If you used AB, then, there would be 4 remaining ODD vertices (C, D, E and F) CD CE CF 3 ways If you used CD, then, there.
CS 3410, Spring 2014 Computer Science Cornell University
Caches: AAT, 3C’s model of misses Prof. Eric Rotenberg
Survey of Cache Compression
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Virtual Memory: Working Sets
Exploring Core Designs for Chip Multiprocessors
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #03 Row/Column Stores, Heap Files, Buffer Manager, Catalogs Instructor: Chen Li.
Presentation transcript:

Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared Markov Table on CMPs

Outline Motivation Multiple cores on single chip Commercial workloads Our study Start from Instruction sharing pattern analysis Our experiments Move onto Instruction cache miss pattern analysis Our experiments Conclusions

Motivation Technology push: CMPs Lower access latency to other processors Application pull: Commercial workloads OS behavior Database applications Opportunities for shared structures Markov based sharing structure Address large instruction footprint VS. small fast I caches

Instruction Sharing Analysis How instruction sharing may occur ? OS: multiple processes, scheduling DB: concurrent transactions, repeated queries, multiple threads How can CMP’s benefit from instruction sharing ? Snoop/grab instruction from other cores Shared structures Let’s investigate it.

Methodology Two-step approach Experiment I Targets Instruction trace analysis How much sharing occurs ? Experiment II Targets I cache miss stream analysis Examine the potential of a shared Markov structure

Experiment I Add instrumentation code to analyze committed instructions Focus on repeated sequences of 2, 3, 4, and 5 instructions across 16P Histogram-based approach P1P2P3P4 {A,B} {A,B} {A,B} {A,B} {A,B} {A,B} How do we Count ? P1 : 3 times P2 : 1 time P3 : 0 times P4 : 2 times Total : 10 times

Results - Experiment I Q.) Is there any Instruction sharing ? A.) Maybe, observe the number of times the sequences 2-5 repeat (~ ) Q.) But why does the numbers for a sequence pattern of 5 Instructions not differ much from a sequence pattern of 2 Instructions ? A.) Spin Loops!! For non warm-up case : 50% For warm-up case : 30%

Experiment II Focus on instruction cache misses Is there sharing involved here too? Upper bound performance benefit of a shared Markov table? Experiment setup 16K-entry fully associative shared Markov table 16K-entry fully associative shared Markov table Each entry has two consecutive misses from same processor Each entry has two consecutive misses from same processor Atomic lookup and hit/miss counter update when a processor has two consecutive I $ misses. Atomic lookup and hit/miss counter update when a processor has two consecutive I $ misses. On a miss, Insert a new entry to LRU head On a miss, Insert a new entry to LRU head On a hit, Record distance from the LRU head and move the hit entry to LRU head On a hit, Record distance from the LRU head and move the hit entry to LRU head

Design Block Diagram P I$ P Markov Table L2 $ Small fast shared Markov table Prefetch when I$ miss occurs

Table Lookup Hit Ratio Q1.) Is there a lot of miss sharing? Q2.) Does constructive interference pattern exist to help a CMP? Q3.) Do equal opportunities exist for all the P?

Let’s Answer the Questions? A1.) Yes Of course A2.) Definitely a constructive interference pattern exists as you see from the figure A3.) Yes. Hit/miss ratio remains pretty stable across processor in spite of variance in the number of I cache misses.

How Big Should the Table Be ? About 60% of hits are within 4K entries away from LRU head. A shared Markov table can fairly utilize I cache miss sharing. What about snooping and grabbing instructions from other I caches?

Real Design Issues Associativity and size of the table Associativity and size of the table Choose the right path if multiple paths exist Choose the right path if multiple paths exist Separate address directory from data entries for the table and have multiple address directories Separate address directory from data entries for the table and have multiple address directories What if a sequential prefetcher exists? What if a sequential prefetcher exists?

Conclusions Instruction sharing on CMPs exists. Spin loops occur frequently with current workloads. Markov-based structure for storing I cache misses may be helpful on CMPs.

Questions?

Comparison with Real Markov Prefetching ABC 5AE 2ADF 3 Cnt P Miss to A and prefetch along A, B & CABAC AD BD LRU head LRU Tail Hit Cnt 2 Miss Cnt 3 PAC Misses to A & C and then look up in the table Update hit/miss counters and change/record LRU

Lookup Example I AB AC AD BD LRU head LRU Tail PAC Look up Hit Cnt 2 Miss Cnt 3 ACAB AD BD LRU head Hit Cnt 3 Miss Cnt 3

Lookup Example II AB AC AD BD LRU head LRU Tail PCD Look up Hit Cnt 2 Miss Cnt 3 ACAB AD CD LRU head Hit Cnt 2 Miss Cnt 4