Presentation is loading. Please wait.

Presentation is loading. Please wait.

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.

Similar presentations


Presentation on theme: "Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel."— Presentation transcript:

1 Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel Institute of Technology

2 2 CMP’s severely stress on-chip caches  Capacity  Bandwidth  Latency Data sharing complicates our life  Contention on shared data  Synchronization Caches are a principal challenge in CMP How to organize & handle data in CMP caches?

3 3 Outline Caches in CMP  Cache-in-the-Middle layout Application characterization Nahalal solution  Overview Results Putting Nahalal into practice  Line search  Scalability Summary

4 4 Tackling Cache Latency via NUCA Due to the growing wire delay:  Hit time depends on physical location [Agarwal et al., ISCA 2000] Non uniform access times  Closer data => smaller hit time Aim for vicinity of reference  Locate data lines closer to their client NUCA - Non Uniform Cache Architecture NUCA - Non Uniform Cache Architecture [Kim et al., ASPLOS’02, Beckmann and Wood, MICRO’04] L2 Cache Migrate cache lines towards processors that access them Dynamic NUCA (DNUCA) Dynamic NUCA (DNUCA) [Kim et al., ASPLOS’02, Beckmann and Wood, MICRO’04] Source: [Keckler et al., ISSCC 2003]

5 5 Cache-In-the-Middle Layout (CIM) Shared L2 cache  Higher capacity utilization  Single copy  no inter-cache coherence Banked, DNUCA  Interconnected using Network-on-Chip (NoC) CPU0 CPU1CPU3 CPU2 CP40 CPU5CPU7 CPU6 CPU0 CPU1CPU3 CPU2 CPU4 CPU5CPU7 CPU6 Bank0Bank1 Bank2Bank3 Bank4Bank5 Bank6Bank7 [Beckmann et al. Micro’06] [Beckmann and Wood, MICRO’04]

6 6 Remoteness of Shared Data Inevitably resides far from (some of) its clients  Long access times CPU0 CPU1CPU3 CPU2 CP40 CPU5CPU7 CPU6 CPU0 CPU1CPU3 CPU2 CPU4 CPU5CPU7 CPU6 Bank0Bank1 Bank2Bank3 Bank4Bank5 Bank6Bank7

7 7 For many parallel applications:  Splash-2, SpecOMP, Apache, Specjbb, STM,.. Observations on Memory Accesses 1. Access to shared lines is substantial 2. Shared lines are shared by many processors 3. A small number of lines make for a large fraction of the total accesses A small number of lines, shared by many processors, is accessed numerous times ⇒ Shared hot lines effect

8 8 CPU0 CPU1CPU3 CPU2 CP40 CPU5CPU7 CPU6 CPU0 CPU1CPU3 CPU2 CPU4 CPU5CPU7 CPU6 Bank0Bank1 Bank2Bank3 Bank4Bank5 Bank6Bank7 Shared Data Hinders Cache Perf. What can be done better? Bring shared data closer to all processors Preserve vicinity of private data

9 9 Aerial view of Nahalal cooperative village P0 P1 P2 P3 P4 P5 P6 P7 This Has Been Addressed Before Overview of Nahalal cache organization

10 10 A more realistic layout: P0 P1 P2 P3 P4 P5 P6 P7 Nahalal Layout A new architectural differentiation of cache lines  According to the way the data is used Private vs. Shared Designated area for shared data lines in the center  Small & fast structure  Close to all processors Outer rings used for private data  Preserves vicinity of private data P0 P1 P2 P3 P4 P5 P6 P7 CPU0 CPU1 CPU5 CPU3 CPU7 CPU2 CPU6 CPU4 CPU0 CPU1 CPU5 CPU3 CPU7 CPU2 CPU6 CPU4 Bank0 Bank1 Bank2 Bank3 Shared Bank Bank7 Bank6 Bank5 Bank4

11 11 Nahalal Cache Management Where does the data go? First access – go to private yard of requester Accesses by additional cores – go to the middle On eviction from over-crowded middle, can go to any sharer’s private yard In typical workloads:  virtually all accesses to shared data satisfied from the middle CPU0 CPU1 CPU5 CPU3 CPU7 CPU2 CPU6 CPU4 CPU0 CPU1 CPU5 CPU3 CPU7 CPU2 CPU6 CPU4 Bank0 Bank1 Bank2 Bank3 Shared Bank Bank7 Bank6 Bank5 Bank4

12 12 Full system simulation via SIMICS 8 Processor CMP Private L1 for each processor (32KByte) 16MByte of shared L2 Simulations CPU0 CPU1 CPU5 CPU3 CPU7 CPU2 CPU6 CPU4 CPU0 CPU1 CPU5 CPU3 CPU7 CPU2 CPU6 CPU4 Bank0 Bank1 Bank2 Bank3 Shared Bank Bank7 Bank6 Bank5 Bank4 CPU0 CPU1CPU3 CPU2 CP40 CPU5CPU7 CPU6 CPU0 CPU1CPU3 CPU2 CPU4 CPU5CPU7 CPU6 Bank0Bank1 Bank2Bank3 Bank4Bank5 Bank6Bank7 CIM (Cache In the Middle)Nahalal  2MB near each processor  1.875MB near each processor  1MB in the middle

13 13 26.8% improvement in average cache hit time  41.1% in apache Average Cache Hit Time (cycles) Cache Performance # clock cycles 3.9% 8.57% 40.53% 41.1% 29.06% 29.35% 39.4% 29.1% 24.2%

14 14 Average Relative Distance Nahalal shortens the distance to shared data Distance to private data remains roughly the same Average Distance – Shared vs. Private

15 15 Putting Nahalal into Practice Line search:  How to find a line within the cache Line Migration:  When and where to move a line between places in the cache Scalability:  How far can we take the Nahalal structure “The difference between theory and practice is always larger in practice than it is in theory” [Peter H. Salus]

16 16 Summary State-of-the-art cache’s weakness  Remoteness of shared data Software behavior:  Shared-hot-lines effect  Shared data hinders cache performance Nahalal cache architecture  Places shared lines closer to all processor  Preserve vicinity of private data A new architectural differentiation of cache lines  Not all data should be treated equally  Data-usage-aware design P0 P1 P2 P3 P4 P5 P6 P7 Questions ?

17 17 Backup

18 18 Scalability Issues This has (also) been addressed before A cluster of Garden-Cities (Ebenezer Howard, 1902) Clustered Nahalal CMP design Nahalal Kfar Yehoshua


Download ppt "Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel."

Similar presentations


Ads by Google