Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
AN ANALYTICAL MODEL TO STUDY OPTIMAL AREA BREAKDOWN BETWEEN CORES AND CACHES IN A CHIP MULTIPROCESSOR Taecheol Oh, Hyunjin Lee, Kiyeon Lee and Sangyeun.
1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.
CSCE 432/832 High Performance ---- An Introduction to Multicore Memory Hierarchy Dongyuan Zhan CS252 S05.
LEMap: Controlling Leakage in Large Chip-multiprocessor Caches via Profile-guided Virtual Address Translation Jugash Chandarlapati Mainak Chaudhuri Indian.
Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.
Nikos Hardavellas, Northwestern University
High Performing Cache Hierarchies for Server Workloads
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee.
1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.
1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )
Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
Lecture 17: Virtual Memory, Large Caches
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.
CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache.
McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.
Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Dynamic Cache Clustering for Chip Multiprocessors
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
By Islam Atta Supervised by Dr. Ihab Talkhan
1 Lecture 12: Large Cache Design Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.
15-740/ Computer Architecture Lecture 18: Caching in Multi-Core Prof. Onur Mutlu Carnegie Mellon University.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Lecture: Large Caches, Virtual Memory
Adaptive Cache Partitioning on a Composite Core
ASR: Adaptive Selective Replication for CMP Caches
Xiaodong Wang, Shuang Chen, Jeff Setter,
Lecture: Large Caches, Virtual Memory
‘99 ACM/IEEE International Symposium on Computer Architecture
18742 Parallel Computer Architecture Caching in Multi-core Systems
Lecture: Large Caches, Virtual Memory
Building Expressive, Area-Efficient Coherence Directories
Lecture 13: Large Cache Design I
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Energy-Efficient Address Translation
What we need to be able to count to tune programs
Optical Overlay NUCA: A High Speed Substrate for Shared L2 Caches
Presentation transcript:

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University of Pittsburgh

CMPMSI’07 02/11/07 Multicore distributed L2 caches  L2 caches typically sub-banked and distributed IBM Power4/5: 3 banks Sun Microsystems T1: 4 banks Intel Itanium2 (L3): many “sub-arrays”  (Distributed L2 caches + switched NoC)  NUCA  Hardware-based management schemes Private caching Shared caching Hybrid caching Local L2 Cache Processor Core Router

CMPMSI’07 02/11/07 Private and shared caching Private caching:  short hit latency (always local)  high on-chip miss rate  long miss resolution time  complex coherence enforcement Shared caching:  low on-chip miss rate  straightforward data location  simple coherence (no replication)  long average hit latency

CMPMSI’07 02/11/07 Other approaches  Hybrid/flexible schemes “Core clustering” [Speight et al., ISCA2005] “Flexible CMP cache sharing” [Huh et al., ICS2004] “Flexible bank mapping” [Liu et al., HPCA2004]  Improving shared caching “Victim replication” [Zhang and Asanovic, ISCA2005]  Improving private caching “Cooperative caching” [Chang and Sohi, ISCA2006] “CMP-NuRAPID” [Chishti et al., ISCA2005]

CMPMSI’07 02/11/07 Motivation Miss rate Hit latency What is the optimal balance between miss rate and hit latency?

CMPMSI’07 02/11/07 Talk roadmap  Data mapping, a key property [cho and Jin, Micro2006]  Two-dimensional (2D) page coloring algorithm  Evaluation and results  Conclusion and future works

CMPMSI’07 02/11/07 Data mapping  Data mapping Memory data  location in L2 cache  Private caching Data mapping determined by program location Mapping created at miss time No explicit control  Shared caching Data mapping determined by address slice number = (block address) % (N slice ) Mapping is static No explicit control

CMPMSI’07 02/11/07 Page Change mapping granularity slice number = (block address) % (N slice) Block granularityPage granularity Page slice number = (page address) % (N slice)

CMPMSI’07 02/11/07 OS controlled page mapping Memory pages Program 1 Program 2 OS PAGE ALLOCATION Virtual address spacePhysical address space

CMPMSI’07 02/11/07 2D page coloring: the problem Page accessmiss Page Network latency / hop = 3 cycles Memory latency = 300 cycles Cost(color #) = (# access x # hop x 3 cycles) + (# miss x 300 cycles) cost P

CMPMSI’07 02/11/07 2D coloring algorithm  Collect L2 reference trace  Derive conflict information [Sherwood et al., ICS1999] Page APage CPage B Reference 1Reference 2Reference 3Reference 4

CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  Derive conflict information Page A Reference 1 Reference Matrix ABC A000 B000 C000 Conflict Matrix ABC A000 B000 C000 1

CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  Derive conflict information Page A Reference 1 Reference Matrix ABC A000 B100 C100 Conflict Matrix ABC A000 B000 C000

CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  Derive conflict information Page A Reference 1 Page B Reference 2 Reference Matrix ABC A000 B100 C100 Conflict Matrix ABC A000 B000 C

CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  Derive conflict information Page A Reference 1 Page B Reference 2 Reference Matrix ABC A010 B100 C110 Conflict Matrix ABC A000 B000 C

CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  Derive conflict information Page A Reference 1 Page B Reference 2 Page B Reference 3 Reference Matrix ABC A010 B000 C110 Conflict Matrix ABC A000 B100 C000

CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  Derive conflict information Page A Reference 1 Page B Reference 2 Page B Reference 3 Page C Reference 4 Reference Matrix ABC A010 B000 C110 Conflict Matrix ABC A000 B100 C

CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  Derive conflict information Page A Reference 1 Page B Reference 2 Page B Reference 3 Page C Reference 4 Reference Matrix ABC A011 B001 C110 Conflict Matrix ABC A000 B100 C

CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  2D Page coloring Page A Reference 1 Page B Reference 2 Page B Reference 3 Page C Reference 4 Reference Matrix ABC A011 B001 C000 Conflict Matrix ABC A000 B100 C110 Conflict Matrix ABC A000 B100 C110 Access Counter ABC 121

CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  2D Page coloring Conflict Matrix ABC A000 B100 C110 Access Counter ABC 121 #Conflict(color)#Access Cost(color, page#) = ( x mem latency) + x #hop(color) x hop delay) Optimal color(page#) = {C | Cost(C) = MIN[Cost(color, page#)] for all colors} α x (1-α) x

CMPMSI’07 02/11/07 Experiments setup  Experiments were carried out using simulator derived from SimpleScalar toolset.  The simulator models a 16-core tile-based CMP.  Each core has private 32KB I/D L1, global shared 256KB L2 slice (total 4MB). Profiling2D coloring Timing Simulation Trace Page mapping Tuning α

CMPMSI’07 02/11/07 Optimal page mapping gcc α = 1/64 # of pages x y x y α = 1/256

CMPMSI’07 02/11/07 Access distribution α 1/32 – 1/2048

CMPMSI’07 02/11/07 Relative performance

CMPMSI’07 02/11/07 Value of α

CMPMSI’07 02/11/07 Conclusions  With cautious data placement, there is huge room for performance improvement.  Dynamic mapping schemes with information assisted by hardware are possible to achieve similar perform- ance improvement.  This method can also be applied to other optimization target.

CMPMSI’07 02/11/07 Current and future works  Dynamic mapping schemes Performance Power  Multiprogrammed and parallel workloads

CMPMSI’07 02/11/07 Thank you & Questions?

CMPMSI’07 02/11/07 Private caching 1. L1 miss 2. L2 access Hit Miss 3. Access directory A copy on chip Global miss L1 miss Local L2 access  short hit latency (always local)  high on-chip miss rate  long miss resolution time  complex coherence enforcement

CMPMSI’07 02/11/07 Shared caching 1. L1 miss 2. L2 access Hit Miss L1 miss  low on-chip miss rate  straightforward data location  simple coherence (no replication)  long average hit latency

CMPMSI’07 02/11/07 Performance Performance improvement Over shared caching 141% 150%