Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

The University of Adelaide, School of Computer Science
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Distributed Systems CS
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Nov 18, 2005 Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for.
Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University
June 30th, 2006 ICS’06 -- Håkan Zeffer: Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based.
Multiple Processor Systems
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
UMA Bus-Based SMP Architectures
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Architecture Research Team (UART)1 Zoran Radović and Erik Hagersten {zoranr, Uppsala University Information Technology.
RH Locks Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] RH Lock: A Scalable Hierarchical.
Euro-Par Uppsala Architecture Research Team [UART] | Uppsala University Dept. of Information Technology Div. of.
NUCA Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Efficient Synchronization for Non-Uniform.
HBO Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Hierarchical Back-Off (HBO) Locks for Non-Uniform.
The University of Adelaide, School of Computer Science
Chapter 17 Parallel Processing.
Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
WildFire: A Scalable Path for SMPs Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008.
1 The Google File System Reporter: You-Wei Zhang.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
TECHNIQUES FOR REDUCING CONSISTENCY- RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS J. B. Carter University of Utah J. K. Bennett and W. Zwaenepoel.
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
Cotter-cs431 Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved Chapter 8 Multiple Processor Systems.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.
WildFire: A Scalable Path for SMPs Erick Hagersten and Michael Koster Sun Microsystems Inc. Presented by Terry Arnold II.
The University of Adelaide, School of Computer Science
Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
DDM – A Cache-Only Memory Architecture
Lecture 9: Directory Protocol Implementations
Implementing an OpenMP Execution Environment on InfiniBand Clusters
High Performance Computing
The University of Adelaide, School of Computer Science
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Lecture 17 Multiprocessors and Thread-Level Parallelism
Jakub Yaghob Martin Kruliš
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based Shared Memory Zoran Radovic and Erik Hagersten {zoranr,

Supercomputing 2001Uppsala Architecture Research Team (UART) Problems with Traditional SW-DSMs  Page-sized coherence unit  False Sharing! [e.g., Ivy, Munin, TreadMarks, Cashmere-2L, GeNIMA, …]  Protocol agent messaging is slow  Most efficiency lost in interrupt/poll CPUs Mem Prot. agent CPUs Mem Prot. agent LD x

Supercomputing 2001Uppsala Architecture Research Team (UART) Our proposal: D S Z O O M  Run entire protocol in requesting-processor  No protocol agent communication!  Assumes user-level remote memory access  put, get, and atomics [  InfiniBand SM ]  Fine-grain access-control checks [e.g., Shasta, Blizzard-S, Sirocco-S] CPUs Mem Protocol CPUs Mem atomic, get/put DIR get LD x DIR

Supercomputing 2001Uppsala Architecture Research Team (UART) Outline Motivation  General DSZOOM Overview  DSZOOM-WF Implementation Details  Experimentation Environment  Performance Results  Conclusions

Supercomputing 2001Uppsala Architecture Research Team (UART) DSZOOM Cluster  DSZOOM Nodes:  Each node consists of an unmodified SMP multiprocessor  SMP hardware keeps coherence among the caches and the memory within each SMP node  DSZOOM Cluster Network:  Non-coherent cluster interconnect  Inexpensive user-level remote memory access  Remote atomic operations [e.g., InfiniBand SM ]

Supercomputing 2001Uppsala Architecture Research Team (UART) Squeezing Protocols into Binaries …  Static Binary Instrumentation  EEL — Machine-independent Executable Editing Library implemented in C++ Instrument global LOADs with snippets containing fine-grain access control checks Instrument global STOREs with MTAG snippets Insert calls to coherence protocols implemented in C

Supercomputing 2001Uppsala Architecture Research Team (UART) 1: ld [address],%reg // original LOAD 2: fcmps %fcc0,%reg,%reg // compare reg with itself 3: fbe,pt %fcc0,hit // if (reg == reg) goto hit 4: nop 5: // Call global coherence load routine hit: Fine-grain Access Control Checks  The “magic” value is a small integer corresponding to an IEEE floating-point NaN [e.g., Blizzard-S, Sirocco-S]  Floating-point load example: Coherence Protocols (C-code)

Supercomputing 2001Uppsala Architecture Research Team (UART) Blocking Directory Protocols  Originally proposed to simplify the design and verification of HW-DSMs  Eliminates race conditions  DSZOOM implements a distributed version of a blocking protocol Node 0 G_MEM LOCK After MEM_STORE Presence bits DIR_ENTRY LOCK Before MEM_STORE One DIR_ENTRY per cache line Distributed DIR MEM_STORE

Supercomputing 2001Uppsala Architecture Research Team (UART) Global Coherency Action Read data from home node: 2–hop read MemDIR 1a. f&s = Small packet (~10 bytes) = Large packet (~68 bytes) = Message on the critical path = Message off the critical path 1b. get data 2. put Requestor LD x

Supercomputing 2001Uppsala Architecture Research Team (UART) Global Coherency Action Read data modified in a third node: 3–hop read DIR Mem MTAG 1. f&s 3b. put 2a. f&s 2b. get data 3a. put Requestor LD x

Supercomputing 2001Uppsala Architecture Research Team (UART) Compilation Process DSZOOM-WF Implementation of PARMACS Macros a.out (Un)executable EEL DSZOOM-WF Run-Time Library m4 GN U gcc Unmodified SPLASH-2 Application Coherence Protocols (C-code)

Supercomputing 2001Uppsala Architecture Research Team (UART) Instrumentation Performance ProgramProblem Size % LD % ST Instrumentation Overhead FFT1,048,576 points (48.1 MB) LU-Cont 1024  1024, block 16 (8.0 MB) LU-Non-Cont 1024  1024, block 16 (8.0 MB) Radix4,194,304 items (36.5 MB) Barnes-Hut16,384 bodies (32.8 MB) FMM32,768 particles (8.1 MB) Ocean-Cont 514  514 (57.5 MB) Ocean-Non-Cont 258  258 (22.9 MB) RadiosityRoom (29.4 MB) RaytraceCar (32.2 MB) Water-nsq2,197 mols., 2 steps (2.0 MB) Water-sp2,197 mols., 2 steps (1.5 MB) Average

Supercomputing 2001Uppsala Architecture Research Team (UART) Instrumentation Breakdown Sequential Execution

Supercomputing 2001Uppsala Architecture Research Team (UART) Current DSZOOM Hardware  Two E6000 connected through a hardware-coherent interface (Sun- WildFire) with a raw bandwidth of 800 MB/s in each direction  Data migration and coherent memory replication (CMR) are kept inactive  16 UltraSPARC II (250 MHz) CPUs per node and 8 GB memory  Memory access times: 330 ns local / 1700 ns remote (lmbench latency)  Run as 16-way SMP, 2  8 CC-NUMA, and 2  8 SW-DSM

Supercomputing 2001Uppsala Architecture Research Team (UART) Stack Text & Data Heap PRIVATE_DATA shmid = A Physical Memory Cabinet 1 shmget shmid = B shmget Physical Memory Cabinet 2 Process and Memory Distribution Cabinet 1 fork pset_bind pset_bind fork 0x G_MEM Cabinet_1_G_MEM Cabinet_2_G_MEM Cabinet_1_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM Cabinet_2_G_MEM Cabinet_1_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM Cabinet_2_G_MEM Stack Text & Data Heap PRIVATE_DATA Stack Text & Data Heap PRIVATE_DATA Cabinet_1_G_MEM Cabinet_2_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM ”Aliasing” Stack Text & Data Heap PRIVATE_DATA Cabinet 2 Stack Text & Data Heap PRIVATE_DATA Cabinet_1_G_MEM Cabinet_2_G_MEM Stack Text & Data Heap PRIVATE_DATA G_MEM shmat

Supercomputing 2001Uppsala Architecture Research Team (UART) Results (1) Execution Times in Seconds (16 CPUs) HWSW EEL 8 8 SW 16 EEL EEL

Supercomputing 2001Uppsala Architecture Research Team (UART) Results (2) Normalized Execution Time Breakdowns (16 CPUs) SW 88 EEL

Supercomputing 2001Uppsala Architecture Research Team (UART)  DSZOOM completely eliminates asynchronous messaging between protocol agents  Consistently competitive and stable performance in spite of high instrumentation overhead   30% slowdown compared to hardware  State-of-the-art checking overheads are in the range of 5–35% (e.g., Shasta), DSZOOM: 3–59% Conclusions

Supercomputing 2001Uppsala Architecture Research Team (UART) DSZOOM’s Home Page