Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) David Brooks (Harvard, USA)Antonio González (Intel-UPC,

Slides:

Advertisements

Similar presentations

TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST

Advertisements

All Programmable FPGAs, SoCs, and 3D ICs

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Around the World AdditionSubtraction MultiplicationDivision AdditionSubtraction MultiplicationDivision.

Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

BT Wholesale October Creating your own telephone network WHOLESALE CALLS LINE ASSOCIATED.

Yuejian Xie, Gabriel H. Loh. Core0 IL1 DL1 Core1 IL1 DL1 Last Level Cache (LLC) Core1s Data 2 Core0s Data.

SE-292 High Performance Computing

Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.

Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals.

Directory-Based Cache Coherence Marc De Melo. Outline Non-Uniform Cache Architecture (NUCA) Cache Coherence Implementation of directories in multicore.

Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Improving DRAM Performance by Parallelizing Refreshes with Accesses

CRUISE: Cache Replacement and Utility-Aware Scheduling

Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.

Bypass and Insertion Algorithms for Exclusive Last-level Caches

1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.

CSE 502: Computer Architecture

GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Addition 1’s to 20.

25 seconds left…...

SE-292 High Performance Computing

We will resume in: 25 Minutes.

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

1 Unit 1 Kinematics Chapter 1 Day

1 PART 1 ILLUSTRATION OF DOCUMENTS  Brief introduction to the documents contained in the envelope  Detailed clarification of the documents content.

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^ PACT 2014 * ^

Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.

1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.

The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.

Non-Uniform Cache Architectures for Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

By Islam Atta Supervised by Dr. Ihab Talkhan

Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

COSC3330 Computer Architecture

Lecture: Large Caches, Virtual Memory

ASR: Adaptive Selective Replication for CMP Caches

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Presentation transcript:

Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) David Brooks (Harvard, USA)Antonio González (Intel-UPC, Spain) HiPC 2011, Bangalore (India) – December 21, 2011

CMPs incorporate large LLC. POWER7 implements L3 cache with eDRAM. 3x density. 3.5x lower energy consumption. Increases latency few cycles. We propose a placement policy to accomodate both technologies in a NUCA cache % chip area 2

NUCA divides a large cache in smaller and faster banks. Cache access latency consists of the routing and bank access latencies. Banks close to cache controller have smaller latencies than further banks. Processor [1] Kim et al. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Architectures. ASPLOS02 3

SRAM provides high-performance. eDRAM provides low power and high density. SRAMeDRAM LatencyX1.5x DensityX3x Leakage2xX Dynamic energy 1.5xX Need refresh? NoYes 4

Introduction Methodology Implementing a hybrid NUCA cache Analysis of our design Exploiting architectural benefits Conclusions 5

MigrationPlacement Access Replacement Placement Access Migration Replacement Core 0Core 1Core 2Core 3 Core 4Core 5Core 6Core 7 16 positions per data Partitioned multicast Gradual promotion LRU + Zero-copy Core 0 [2] Beckmann and Wood. Managing Wire Delay in Large Chip-Multiprocessor Caches. MICRO04 6

Number of cores8 – UltraSPARC IIIi Frequency1.5 GHz Main Memory Size4 Gbytes Memory Bandwidth512 Bytes/cycle Private L1 caches8 x 32 Kbytes, 2-way Shared L2 NUCA cache8 MBytes, 128 Banks NUCA Bank64 KBytes, 8-way L1 cache latency3 cycles NUCA bank latency4 cycles Router delay1 cycle On-chip wire delay1 cycle Main memory latency250 cycles (from core) GEMS Simics Solaris 10 PARSEC SPEC CPU x UltraSPARC IIIi Ruby Garnet Orion 7

Introduction Methodology Implementing a hybrid NUCA cache Analysis of our design Exploiting architectural benefits Conclusions 8

Fast SRAM banks are located close to the cores. Slower eDRAM banks in the center of the NUCA cache. PROBLEM: Migration tends to concentrate shared data in central banks. 9 Core 0Core 1Core 2Core 3 Core 4Core 5Core 6Core 7 eDRAM SRAM

Significant amount of data in the LLC are not accessed during their lifetime. SRAM banks store most frequently accessed data. eDRAM banks allocate data blocks that either: Just arrived to the NUCA, or Were evicted from SRAM banks. 10

First goes to an eDRAM. If accessed, it moves to SRAM. Features: Migration between SRAM banks. Lack of communication in eDRAM. No eviction from SRAM banks. eDRAM is extra storage for SRAM. PROBLEM: Access scheme must search to the double number of banks. eDRAM SRAM Core 0Core 1Core 2Core 3 Core 4Core 5Core 6Core 7 11

Tag Directory Array (TDA) stores tags of eDRAM banks. Using TDA, the access scheme looks up to 17 banks. TDA requires 512 Kbytes for an 8 Mbyte (4S-4D) hybrid NUCA cache. 12

Heterogeneous + TDA outperforms the other hybrid alternatives. 13 We use Heterogeneous + TDA as hybrid NUCA cache in further analysis.

Introduction Methodology Implementing a hybrid NUCA cache Analysis of our design Exploiting architectural benefits Conclusions 14

Well-balanced configurations achieve similar performance as all-SRAM NUCA cache. The majority of hits are in SRAM banks. 15

Hybrid NUCA pays for TDA. The less SRAM the hybrid NUCA uses, the better. 16

Similar performance results as all-SRAM. Reduces power consumption by 10%. Occupies 15% less area than all-SRAM. 4S-4D 17

Introduction Methodology Implementing a hybrid NUCA cache Analysis of our design Exploiting architectural benefits Conclusions 18

19 all SRAM banks SRAM: 4MBytes eDRAM: 4MBytes 15% reduction on area +1MByte in SRAM banks +2MBytes in eDRAM banks 5S-4D 4S-6D SRAM eDRAM

And do not increase power consumption. Both configurations increases performance by 4%. 20

Introduction Methodology Implementing a hybrid NUCA cache Analysis of our design Exploiting architectural benefits Conclusions 21

IBM® integrates eDRAM in its latest general-purpose processor. We implement a hybrid NUCA cache, that effectively combines SRAM and eDRAM technologies. Our placement policy succeeds in concentrating most accesses to the SRAM banks. Well-balanced hybrid cache achieves similar performance as all- SRAM configuration, but occupies 15% less area and dissipates 10% less power. Exploiting architectural benefits we achieve up to 10% performance improvement, and by 4%, on average. 22

Questions? HiPC 2011, Bangalore (India) – December 21, 2011