DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer.

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.

Distributed Systems CS

SE-292 High Performance Computing

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Multiple Processor Systems

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.

Introduction to MIMD architectures

DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.

CS 284a, 7 October 97Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 7 October 1997.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Parallel Computer Architecture: Essentials for Both Computer Scientists and Engineers Edward F. Gehringer †* Yan.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Multiprocessor Cache Coherency

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.

MIMD Shared Memory Multiprocessors. MIMD -- Shared Memory u Each processor has a full CPU u Each processors runs its own code –can be the same program.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Parallel Computer Architecture and Interconnect 1b.1.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Anshul Kumar, CSE IITD Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1 st May, 2006.

Super computers Parallel Processing

Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Background Computer System Architectures Computer System Software.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Multiprocessor Systems

Architecture and Design of AlphaServer GS320

Ramya Kandasamy CS 147 Section 3

Reactive NUMA A Design for Unifying S-COMA and CC-NUMA

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

The University of Adelaide, School of Computer Science

Multiprocessor Cache Coherency

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

Death Match ’92: NUMA v. COMA

Outline Midterm results summary Distributed file systems – continued

Multiprocessors - Flynn’s taxonomy (1966)

Course Outline Introduction in algorithms and applications

DDM – A Cache-Only Memory Architecture

/ Computer Architecture and Design

High Performance Computing

Chapter 4 Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 17 Multiprocessors and Thread-Level Parallelism

CSL718 : Multiprocessors 13th April, 2006 Introduction

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Multiprocessor System Interconnects

Presentation transcript:

DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer Architecture

Shared Memory MP - Taxonomy Shared Memory Multiprocessors Single Memory (Usually UMA) Distributed Memory (Usually NUMA) Cache - Only 2CS258 - Parallel Computer Architecture

Unified Memory Architecture (UMA) All processors take the same time to reach the memory The network could be a bus or fat tree etc There could be one or more memory units Cache coherence is usually through snoopy protocols for bus-based architectures 3CS258 - Parallel Computer Architecture

Non-Uniform Memory Architecture (NUMA) The network can be anything Eg. Butterfly, Mesh, Torus etc Scales well – upto 1000’s of processors Cache coherence usually maintained through directory based protocols Partitioning of data is static and explicit 4CS258 - Parallel Computer Architecture

Cache-Only Memory Architecture (COMA) Data partitioning is dynamic and implicit Attraction memory acts as a large cache for the processor Attraction memory can hold data that the processor will never access !! (Think of a distributed file system) USP: Can give UMA-like performance on NUMA architectures 5CS258 - Parallel Computer Architecture

COMA Addressing Issues Item – Similar to cache line, item is the coherence unit moved around Memory references – Virtual address -> item identifier – Item identifier space is logically the same as physical address space, but there is no permanent mapping Item migration improves efficiency – Programmer only has to make sure locality holds, data partitioning can be dynamic 6CS258 - Parallel Computer Architecture

Data Diffusion Machine (DDM) DDM is a hierarchical structure implementing COMA Uses DDM bus Attraction memory communicates with – processor using below protocol – DDM bus using above protocol (snoopy) At the topmost level, node uses Top protocol 7CS258 - Parallel Computer Architecture

Architecture of single bus DDM CS258 - Parallel Computer Architecture8

Single-bus DDM protocol An item can in one of the seven states – Invalid – Exclusive – Shared – Reading – Waiting – Reading and waiting – Answering The bus carries the following transactions – Erase – Exclusive – Read – Data – Inject – Out 9CS258 - Parallel Computer Architecture

Single bus DDM protocol Memory OperationRead Exclusive or Shared? Yes Read/No bus transaction No Issue bus read WriteExclusive? Yes Write/No bus transaction No Shared? Yes Issue Erase & Write No Issue bus transaction & go to Reading&Waiting ReplaceShared? Yes Issue Out No Issue Inject 10CS258 - Parallel Computer Architecture

Attraction Memory Protocol (without replacement) 11CS258 - Parallel Computer Architecture

Hierarchical DDM protocol Directory is similar to Attraction Memory, except that they do not store any data For the bus below, it behaves like Top protocol For bus above, it behaves like above protocol Multilevel read Multilevel write Multilevel replacement 12CS258 - Parallel Computer Architecture

Multilevel DDM protocol Directory requirement – Size: Dir i+1 = B i * Dir i – Associativity: Dir i+1 = B i * Dir i where B i is the branching factor for level I – Too much hierarchy will be costly and slow – Could use “imperfect directories” Protocol is sequentially consistent Bandwidth requirements – Fat tree network – Directory + Bus splitting – Heterogeneous networks 13CS258 - Parallel Computer Architecture

COMA Prototype 14CS258 - Parallel Computer Architecture

Prototype description For address translation, DDM uses normal virtual to physical address translation mechanism For item size = 16 bytes – Overhead is 6% for 32-processor system – Overhead is 16% for 256-processor system For larger item sizes, the overhead is lower, but false sharing may cause problems 15CS258 - Parallel Computer Architecture

Performance 16CS258 - Parallel Computer Architecture

Conclusion COMA is middle ground between UMA and NUMA In the prototype, overhead is 16% in access time and 6-16% in memory Programmer productivity improved by not worrying about NUMA issues CS258 - Parallel Computer Architecture17