CSCE 432/832 High Performance ---- An Introduction to Multicore Memory Hierarchy Dongyuan Zhan CS252 S05.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Virtual Memory Adapted from lecture notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

Lecture 17: Virtual Memory, Large Caches

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Systems I Locality and Caching

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Multi-Core Architectures

ECM534 Advanced Computer Architecture

Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

CS 704 Advanced Computer Architecture

COSC3330 Computer Architecture

Lynn Choi School of Electrical Engineering

ECE232: Hardware Organization and Design

Lecture: Large Caches, Virtual Memory

Yu-Lun Kuo Computer Sciences and Information Engineering

Lecture: Large Caches, Virtual Memory

How will execution time grow with SIZE?

Today How’s Lab 3 going? HW 3 will be out today

Architecture Background

Cache Memory Presentation I

CS-301 Introduction to Computing Lecture 17

Lecture 13: Large Cache Design I

CMSC 611: Advanced Computer Architecture

Lecture 12: Cache Innovations

CMSC 611: Advanced Computer Architecture

Lecture: Cache Innovations, Virtual Memory

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Part V Memory System Design

/ Computer Architecture and Design

Contents Memory types & memory hierarchy Virtual memory (VM)

Lecture: Cache Hierarchies

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Chapter 4 Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Cache - Optimization.

Presentation transcript:

CSCE 432/832 High Performance ---- An Introduction to Multicore Memory Hierarchy Dongyuan Zhan CS252 S05

What We Learnt from the Video The Motivation of Multi-core Processors Better utilization of on-chip transistor resources as technology scales Use thread-level parallelism to increase throughput Two Models of Multi-core Processors Homogenous vs. Heterogeneous CMPs Communication & Synchronization among Cores Communicate with each other via the shared cache/memory Synchronize reads/writes via locks, mutex or transactional memory How to Program Multi-core Processors Using OpenMP to write parallel programs 4/11/2017 CSCE 432/832, CMP Memory Hierarchy CS252 S05

From Teraflop Multiprocessor to Teraflop Multicore ASCI RED (1997~2005) 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

Intel Teraflop Multicore Prototype 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

From Teraflop Multiprocessor to Teraflop Multicore Pictured here is ASCI Red which was the first computer to reach a Teraflops of processing, equal to trillions of calculations per second. Using about 10,000 Pentium Processors running at 200MHz Consuming 500kW of power for computation and another 500kW for cooling Occupy a very large room Intel has now announced just over 10 yeas later that they have developed the world’s first processor that will deliver the same Teraflops performance all on one single 80-core on a single chip running at 5 GHz Consuming only 62 watts power Small enough to rest on the tip of your finger. 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

A Commodity Many-core Processor Tile64 Multicore Processor (2007~now) 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

The Schematic Design of Tile64 DDR2 Memory Controller 0 DDR2 Memory Controller 2 DDR2 Memory Controller 1 PCIe 0 MAC PHY PROCESSOR P2 Reg File P1 P0 CACHE L2 CACHE L1I L1D ITLB DTLB 2D DMA STN MDN TDN UDN IDN SWITCH XAUI MAC PHY 0 Serdes Serdes UART, HPI JTAG, I2C, SPI GbE 0 Flexible IO GbE 1 Flexible IO PCIe 1 MAC PHY 4 essential components Processor Core on-chip Cache Network-on-Chip (NoC) I/O controllers XAUI MAC PHY 1 Serdes Serdes DDR2 Memory Controller 3 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

CSCE 432/832, CMP Memory Hierarchy Agenda Today An Introduction to the Multi-core Memory Hierarchy Why do we need the memory hierarchy for any processors? A tradeoff between capacity and latency Make common cases fast as a result of programs’ locality (general principle in computer architecture) What is the difference between the memory hierarchies of single-core and multi-core CPUs? Quite distinct from each other in on-chip caches Managing the CMP caches is of paramount importance in performance Again, we still have the capacity and latency issues for CMP caches How to keep CMP cache coherent Hardware & software management schemes 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

The Motivation for Mem Hierarchy Trading off between capacity and latency Capacity Access Time Cost Upper Level faster CPU Registers 100s Bytes 0.3-0.5 ns Registers prog./compiler 4-8 bytes Instr. Operands L1 Cache L1 and L2 Cache 10s-100s K Bytes ~1 ns - ~10 ns cache cntl 32 or 64 bytes Blocks L2 Cache On Chip cache cntl 64 or 128 bytes Blocks Main Memory G Bytes 200ns ~ 300ns ~ $15/ GByte Memory OS 4K~ 64K bytes Pages Off Chip Disk 1s -10s T Bytes ~ 10 ms ~ $0.15 / GByte Disk Larger Lower Level 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

CSCE 432/832, CMP Memory Hierarchy Programs’ Locality Two Kinds of Basic Locality Temporal: if a memory location is referenced, then it is likely that the same memory location will be referenced again in the near future. int i; register int j; for (i = 0; i < 20000; i++) for (j = 0; j < 300; j++); Spatial: if a memory location is referenced, then it is likely that nearby memory locations will be referenced in the near future. Locality + smaller HW is to make common cases faster = memory hierarchy 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

The Challenges of Memory Wall The Truths: In many applications, 30-40% the total instructions are memory operations CPU speed scales much faster than the DRAM speed In 1980, CPUs and DRAMs were operated at almost the same speed, about 4MHz~8MHz CPU clock frequency has doubled every 2 years; DRAM speed have only been doubling about every 6 years. 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

CSCE 432/832, CMP Memory Hierarchy Memory Wall DRAM bandwidth is quite limited: two DDR2-800 modules can reach the bandwidth of 12.8GB/sec (about 6.4B/cpu_cycle if the cpu runs at 2GHz). So, in a multicore processor, when multiple 64-bit cores need to access the memory at the same time, they will exacerbate contention on the DRAM bandwidth. Memory Wall: CPU needs to speed a lot of time on off-chip memory accesses. E.g., Intel XScale spends on average 35% of the total execution time on memory accesses. High latency and low bandwidth of the DRAM system becomes a bottleneck for CPUs. 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

CSCE 432/832, CMP Memory Hierarchy Solutions How to alleviate the memory wall problem Hiding the mem access latency: prefetching Reducing the latency: making memory closer to the CPU: 3D-stacked on-chip DRAM Increasing the bandwidth: optical I/O Reducing the number of memory accesses: keeping as much reusable data on cache as possible 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

CMP Cache Organizations (Shared L2 Cache) 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

CMP Cache Organizations (Private L2 Cache) 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

How to Address Blocks in a CMP How to address blocks in a single-core processor L1 caches are typically virtually indexed but physically tagged, while L2 caches are mostly physically indexed and tagged (related to virtual memory). How to address blocks in a CMP L1 caches are accessed in the same way as in a single-core processor If the L2 caches are private, the addressing of a block is still the same If the L2 caches are shared among all of the cores, then 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

How to Address Blocks in a CMP 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

How to Address Blocks in a CMP 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

CSCE 432/832, CMP Memory Hierarchy CMP Cache Coherence Snoop based: All caches on the bus snoop the bus to determine if they have a copy of the block of data that is requested on the bus. Multiple copies of a data block can be read without any coherence problems; however, a processor must have exclusive access (either invalidate or update other copies) to the bus in order to write. Enough for small-scale CMPs with bus interconnection Directory based the data being shared is tracked in a common directory that maintains the coherence between caches. When a cache line is changed the directory either updates or invalidates the other caches with that cache line. Necessary for many-core CMPs with such interconnection as mesh 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

Non-Uniform Cache Access Time in Shared L2 Caches 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

Non-Uniform Cache Access Time in Shared L2 Caches Let’s assume that Core0 needs to access a data block stored in Tile15 Assume that access an L2 cache bank needs 10 cycles; Assume transferring a data block from one router to an adjacent one needs 2 cycles; Then, an remote access to the block in Tile 15 needs 10+2*(2*6)=34 cycles, much greater than an local L2 access. Non-Uniform Cache Access Time (NUCA) means that the latency of accessing an cache is a function of the physical locations of both the requesting core and the cache. 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

How to reduce the latency of Remote Cache Access At least two solutions: Place the data close enough to the requesting core Victim replication [1]: placing L1 victim blocks in the Local L2 cache; Change the layout of the data: I will talk about one approach pretty soon; Use faster transmission Use special on-chip interconnect to transmit data via radio-wave or light-wave signals 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

The RF-Interconnect [2] 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

Interference in Caching in Shared L2 Caches The Problem: because the shared L2 caches are accessible to all cores, one core can interfere with another in placing blocks in L2 caches For example, in a dual-core CMP, if a stream application like a video player is co-scheduled with a scientific computation application that has good locality, then the aggressive stream application will continuously place new blocks in L2 cache and replace the computation application’s cached blocks, thus affecting the computation application’s performance. Solution: Regulate cores’ usage of the L2 cache based on their utility of using the cache [3] 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

The Capacity Problems in Private L2 Caches The Problems: the L2 capacity accessible to each core is fixed, regardless of the core’s real cache capacity demand. E.g., if two applications are co-scheduled on a dual core CMP with two 1MB private L2 caches, and if one application has a cache demand of 0.5 MB while the other asks for 1.5MB, then one private L2 cache is underutilized while the other is overwhelmed. If a parallel program is running on the CMP, different cores will have a lot of data in common. However, the private L2 cache organization requires each core maintain a copy of the common data in its local cache, leading to a lot of data redundancy and degrading the effective A Solution: Cooperative Caching [4] 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

A Comparison Between Shared and Private L2 Caches L2S L2P set mapping First locate the tile and then index the set The same as in a single-core CPU coherence directory each L2 entry has its own directory bits; no separate directory caches. independent shared directory caches employing the same mapping scheme as in Fig. capacity high aggregate capacity for any cores relatively low capacity for each core latency due to the distributed mapping, a lot of requested data (on-chip) will be in non-local L2 $ the requested data (on-chip) is in the private (closest) L2 $ Sharing Capacity & Data None performance Isolation severe contention in L2 capacity allocation among cores no interference among cores Commodity CMPs Intel Core 2 Duo E6600 Sun Sparc Niagara2 Tilera Tile64 (64 cores) AMD Athlon64 6400+ Intel Pentium D 840 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

Using OS to Manage CMP Caches [5] Two kinds of address space: virtual (or logic) & physical Page coloring: there is a correspondence between a physical page and its location in the cache In CMPs with Shared L2 Cache, by changing the mapping scheme, we can use the OS to determine where a virtual page required by a core is located in the L2 cache Tile#(where a page is cached) = physical page number % #Tiles 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

Using OS to Manage CMP Caches 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

Using OS to Manage CMP Caches The Benefits Improved Data Proximity Capacity Sharing Data Sharing (to be introduced next time) 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

CSCE 432/832, CMP Memory Hierarchy Summary What we have covered this class The Memory Wall problem for CMPs The two basic cache organizations for CMPs HW & SW approaches of managing the last level cache. 4/11/2017 CSCE 432/832, CMP Memory Hierarchy

CSCE 432/832, CMP Memory Hierarchy References [1] M. Zhang, et al. Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors. ISCA’05. [2] F. Chang, et al. CMP Network-on-Chip Overlaid With Multi-Band RF-Interconnect. HPCA’08. [3] A. Jaleel, et al. Adaptive Insertion Policies for Managing Shared Caches. PACT’08. [4] J. Chang, et al. Cooperative Caching for Chip Multiprocessors. ISCA’06 [5] S. Cho, et al. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. MICRO’06. 4/11/2017 CSCE 432/832, CMP Memory Hierarchy