Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

LEUCEMIA MIELOIDE AGUDA TIPO 0
Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
0 - 0.
ALGEBRAIC EXPRESSIONS
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)
ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
Year 6 mental test 5 second questions
ZMQS ZMQS
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
BT Wholesale October Creating your own telephone network WHOLESALE CALLS LINE ASSOCIATED.
Yuejian Xie, Gabriel H. Loh. Core0 IL1 DL1 Core1 IL1 DL1 Last Level Cache (LLC) Core1s Data 2 Core0s Data.
Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt §¥ ¥ §
SE-292 High Performance Computing
L.N. Bhuyan Adapted from Patterson’s slides
Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.
CS 105 Tour of the Black Holes of Computing
Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.
Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.
ABC Technology Project
1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre.
1 Overview Assignment 4: hints Memory management Assignment 3: solution.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.
Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :
© S Haughton more than 3?
Twenty Questions Subject: Twenty Questions
Squares and Square Root WALK. Solve each problem REVIEW:
Energy & Green Urbanism Markku Lappalainen Aalto University.
© 2012 National Heart Foundation of Australia. Slide 2.
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 ETHERNET Derived From CCNA Network Fundamentals – Chapter 9 EN0129 PC AND NETWORK TECHNOLOGY.
Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
Addition 1’s to 20.
25 seconds left…...
Test B, 100 Subtraction Facts
Week 1.
SE-292 High Performance Computing
We will resume in: 25 Minutes.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 Unit 1 Kinematics Chapter 1 Day
How Cells Obtain Energy from Food
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
The University of Adelaide, School of Computer Science
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali.
Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
A Case for Interconnect-Aware Architectures
Presentation transcript:

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology University of Victoria 1

Goal: Improving energy efficiency in snoop-based CMPs. Motivation: Broadcasting/processing entire tag is inefficient. Our Solution: Using Partial Tag Comparison (PTC) prior to snoop. Key Results Performance ( 2.9%) Tag array power ( 52%) Bandwidth utilization ( 78.5%) 2 This Work: Improving Snoop Coherency

Our Solution (PTC) vs. Conventional 3 D$ Interconnect Upper Level Cache …. D$ Upper Level Cache …. D$ Interconnect ConventionalOur solution Fast + Power & Bandwidth Fast ++ (early miss detection) Power & Bandwidth Efficient +

Conventional Snooping 4 Address Bus Snoop Bus Command Bus D$ CPU D$ CPU controller Redundant (miss): ~70%

Snoop Filters 5 Goal: Eliminate redundant snoop requests. Example: RegionScout (ISCA05), CGCT(ISCA05), SSP (ASPLOS08) PTC: (1) Early miss detection using subset of tag bits. (2) Once a miss is detected, snoop is avoided. How often is that possible?

6 How often using n bits is enough to detect a miss? 95 + % of misses can be detected using 8 bits.

7 D$ Address Bus LSB misshit Avoid Snoop Access Upper Level Snoop Potential Targets PTC-Filter

8 4-way D$ PTC-Filter Filter … Core1s LSBCore2s LSBCore3s LSB VDLSB 8 bits

PTC: Filter Miss 9 Address Bus Snoop Bus Command Bus D$ CPU D$ CPU 3 2 controller 1

PTC: Filter Hit 10 Address Bus Snoop Bus Command Bus D$ CPU D$ CPU 2 4 controller

Filter Maintenance 11 PTC- Filter CPU 1 B FDE Request =A 3 3 Address Bus Core 0 ….. Core i Addr.CWD Snoop Controller 4 Command Bus miss A. place it in position of tag F 2 2 Pending Request Table {Address=A, C=0,W=1, D=1} A011 Place A, insert in Way 1 of core 0

12 Methodology SESC simulator 4-way CMP SPLASH-2 benchmarks CACTI MB 4-banked 16-way 10 cycle latency L2 6 cycle arbitration + 2 cycle core to controller latency + Crossbar data network+ MESI protocol DL1/IL1 4-way/2-way 64KB/32KB 3 cycle latency 64 B cache line+ 500 cycle Memory access

13 Performance Average: 2.9%

14 Bandwidth Average: 78.5%

15 Tag Power Average: 52%

Why do benchmarks show different performance improvement? Different cache miss frequency Different early miss detection frequency Not all cache misses are on the critical path Filter overhead: Timing: 1 cycle Power: 78.5% of single tag array access 16 Discussion

PTC: Using subset of tag bits to improve bandwidth/power efficiency. Results: Performance: 2.9% Tag Power: 52% Bandwidth: 78.5% 17 Summary

18

19 Global vs. Local Miss D$ Interconnect Upper Level Cache …. D$ Have B? NO D$ interconnect Upper Level Cache …. D$ Have B? NOYES D$ NO Global Miss Local Miss local miss detection better power/bandwidth profile Remote miss detection (source-based approach) vs. (destination-based filter)

20 Partial tag lookup: global miss

21 Partial tag lookup: local miss