“ NAHALAL : Cache Organization for Chip Multiprocessors ” New LSU Policy By : Ido Shayevitz and Yoav Shargil Supervisor: Zvika Guz.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Performance of Cache Memory
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.
Computer Organization CS224 Fall 2012 Lesson 44. Virtual Memory  Use main memory as a “cache” for secondary (disk) storage l Managed jointly by CPU hardware.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.
On-Chip Cache Analysis A Parameterized Cache Implementation for a System-on-Chip RISC CPU.
Processor - Memory Interface
The Memory Hierarchy (Lectures #24) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer Organization.
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
Memory Subsystem Performance of Programs using Coping Garbage Collection Authers: Amer Diwan David Traditi Eliot Moss Presented by: Ronen Shabo.
Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.
“ Nahalal: Cache Organization for Chip Multiprocessors ” New LSU Policy By : Ido Shayevitz and Yoav Shargil Supervisor: Zvika Guz Software Systems Lab.
COEN 180 Main Memory Cache Architectures. Basics Speed difference between cache and memory is small. Therefore:  Cache algorithms need to be implemented.
Maninder Kaur CACHE MEMORY 24-Nov
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
1 Design and Performance of a Web Server Accelerator Eric Levy-Abegnoli, Arun Iyengar, Junehwa Song, and Daniel Dias INFOCOM ‘99.
A Time Predictable Instruction Cache for a Java Processor Martin Schoeberl.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Computer Architecture Lecture 26 Fasih ur Rehman.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
Cache Memory By Tom Austin. What is cache memory? A cache is a collection of duplicate data, where the original data is expensive to fetch or compute.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
11 Intro to cache memory Kosarev Nikolay MIPT Nov, 2009.
CS2100 Computer Organisation Virtual Memory – Own reading only (AY2015/6) Semester 1.
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
Virtual Memory Ch. 8 & 9 Silberschatz Operating Systems Book.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Thread & Processor Scheduling
Processor support devices Part 2: Caches and the MESI protocol
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Multilevel Memories (Improving performance using alittle “cash”)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Architecture Background
Cache Memory Presentation I
Ke Bai and Aviral Shrivastava Presented by Bryce Holton
CMPT 886: Computer Architecture Primer
Hardware Multithreading
ECE 445 – Computer Organization
Module IV Memory Organization.
/ Computer Architecture and Design
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Lecture 22: Cache Hierarchies, Memory
CSE 451: Operating Systems Autumn 2004 Page Tables, TLBs, and Other Pragmatics Hank Levy 1.
CSC3050 – Computer Architecture
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Cache - Optimization.
CSE 451: Operating Systems Winter 2005 Page Tables, TLBs, and Other Pragmatics Steve Gribble 1.
Virtual Memory Lecture notes from MKP and S. Yalamanchili.
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

“ NAHALAL : Cache Organization for Chip Multiprocessors ” New LSU Policy By : Ido Shayevitz and Yoav Shargil Supervisor: Zvika Guz

NAHALAL ARCHTECTURE NAHALAL architecture defines the memory cache banks of the L2 cache. Each processor has a private backyard bank and all processors shared a small bank. The architecture is based on the hot shared line phenomenon.

LSU Improvement  Placement Policy  Replacement Policy from Private Bank : LRU  Replacement Policy from Public Bank : NAHALAL LRU X LSU LSU policy wisely select the Least Shared Used line to throw from the public bank.

LSU Implementation  Shift-register with N cells for each Line.  Each cell in the shift-register hold CPU num  In throwing by CPUi : For each shift-register do XOR between each cell and the ID of CPUi. The shift-register on which the XOR produce 0, will be the chosen one. If non produce 0 then do regular LRU.  In order ro reduce memory overhead, define N=4. Therefore 2 *4*3 = MB  18.75% memory overhead. 14 Simple, short time algorithm in HW

Simulation Structure in Simics Using pyhton script we defined :

Writing Benchmarks Writing Benchmarks is done in the simulated target console :

Writing Benchmarks  Using Threads with pthread library  Each Thread is associated to a CPU using sched library.  Parallel code is written in the benchmark  Also OS code and pthread code cause to Parallel code.  Each benchmark we run first without LSU and second with LSU.

Collecting Statistics Cache statistics: l2c Total number of transactions: Total memory stall time: Total memory hit stall time: Device data reads (DMA): 0 Device data writes (DMA): 0 Uncacheable data reads: 17 Uncacheable data writes: Uncacheable instruction fetches: 0 Data read transactions: Total read stall time: Total read hit stall time: Data read remote hits: 0 Data read misses: Data read hit ratio: 97.43% Instruction fetch transactions: 0 Instruction fetch misses: 0 Data write transactions: Total write stall time: Total write hit stall time: Data write remote hits: 0 Data write misses: 0 Data write hit ratio: % Copy back transactions: 0 Number of replacments in the middle (NAHALAL): 557

Results 1. Improvement of 54% in average stall time per transaction. 2. Improvement of 61% in average stall time per transaction % from the transactions cause a replacement in the middle without LSU, and with LSU only 0.09% ! Improvement of ∆=8.28% % from the transactions cause a replacement in the middle without LSU, and with LSU only 0.02% ! Improvement of ∆=8.73%

Conclusions LSU policy significantly improve average stall time per transaction, Therefore : LSU Policy implemented in NAHALAL architecture significantly reduce number of cycles for a benchmark. LSU policy significantly reduce number of replacements in the middle, Therefore : LSU Policy implemented in NAHALAL architecture, better keep the hot shared lines in the public bank. According to our implementation, LRU is activated if LSU did not find a line, Therefore : LSU Policy as we implemented is always preferable then LRU.