Shared memory and message passing revisited in the many-core era 1 iCSC2016, Aram Santogidis, CERN Shared memory and Message passing revisited in the many-core.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

The Multikernel: A new OS architecture for scalable multicore systems Andrew Baumann et al CS530 Graduate Operating System Presented by.

1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

1 On the Duality of Operating System Structures by Hugh C. Lauer, Xerox Corporation and Roger M. Needham, Cambridge University Presented by Scott Fletcher.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

Computer System Architectures Computer System Software

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Programming Paradigms for Concurrency Part 2: Transactional Memories Vasu Singh

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Heterogeneous Multikernel OS Yauhen Klimiankou BSUIR

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.

Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.

Background Computer System Architectures Computer System Software.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Ottawa Linux Symposium Christoph Lameter, Ph.D. Technical Lead Linux Kernel Software Silicon Graphics, Inc. Extreme High.

T HE M ULTIKERNEL : A NEW OS ARCHITECTURE FOR SCALABLE MULTICORE SYSTEMS Presented by Mohammed Mustafa Distributed Systems CIS*6000 School of Computer.

Parallel programs Inf-2202 Concurrent and Data-intensive Programming Fall 2016 Lars Ailo Bongo

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Introduction to threads

These slides are based on the book:

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Last Class: Introduction

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Introduction to Parallel Processing

Lecture 1 – Parallel Programming Primer

Introduction to parallel programming

Distributed Shared Memory

CS5102 High Performance Computer Systems Thread-Level Parallelism

Definition of Distributed System

The Multikernel: A New OS Architecture for Scalable Multicore Systems

Alternative system models

Data transfer on manycore processors for high throughput applications

Constructing a system with multiple computers or processors

Distributed System Concepts and Architectures

EE 193: Parallel Computing

CMSC 611: Advanced Computer Architecture

Distributed Shared Memory

The Multikernel A new OS architecture for scalable multicore systems

Chapter 4: Threads.

Parallel and Multiprocessor Architectures – Shared Memory

Advanced Computer Architectures

Chapter 4: Threads.

Multiple Processor Systems

Midterm review: closed book multiple choice chapters 1 to 9

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Distributed Systems CS

Multithreaded Programming

Concurrency: Mutual Exclusion and Process Synchronization

Operating Systems Lecture 1.

Prof. Leonardo Mostarda University of Camerino

CS510 - Portland State University

Chapter 4 Multiprocessors

Types of Parallel Computers

Presentation transcript:

Shared memory and message passing revisited in the many-core era 1 iCSC2016, Aram Santogidis, CERN Shared memory and Message passing revisited in the many-core era Aram Santogidis CERN Inverted CERN School of Computing, 29 February – 2 March 2016

Shared memory and message passing revisited in the many-core era 2 iCSC2016, Aram Santogidis, CERN The pioneers of concurrent programming Edsger Dijkstra Mutual exclusion Cooperating Sequential Processes Semaphores Per Brinch Hansen Concurrent Pascal Shared Classes The Solo OS Distributed Processes C.A.R Hoare Communicating Sequential Processes (CSP) Monitors

Shared memory and message passing revisited in the many-core era 3 iCSC2016, Aram Santogidis, CERN Communication is important Time Process 1 Process 2 Process 3 Communication/ Synchronization Communication/ Synchronization Shared memory Message passing VS

Shared memory and message passing revisited in the many-core era 4 iCSC2016, Aram Santogidis, CERN Agenda of the talk  Concurrency and communication  Two basic examples of the two models  Conventional wisdom for the two models  Cache coherence and manycore processors  Emerging paradigm shift in OS architectures  The future perspective

Shared memory and message passing revisited in the many-core era 5 iCSC2016, Aram Santogidis, CERN The Shared Memory model Shared data structures Thread Shared Memory  Threads communicate implicitly with each other via shared data structures  Synchronization primitives (locks, semaphores, etc.)

Shared memory and message passing revisited in the many-core era 6 iCSC2016, Aram Santogidis, CERN The message passing model Thread Message  Threads communicate explicitly with each other by exchanging messages  Is the more fundamental class from the two  Synchronous or asynchronous communication

Shared memory and message passing revisited in the many-core era 7 iCSC2016, Aram Santogidis, CERN Lets see an example for each model 1.Image processing (shared memory) 2.Simple GUI (message passing)

Shared memory and message passing revisited in the many-core era 8 iCSC2016, Aram Santogidis, CERN A shared memory-based example: Convert from colour to grayscale R: 204 G: 46 B: 10 (R+G+B) / 3 = 130

Shared memory and message passing revisited in the many-core era 9 iCSC2016, Aram Santogidis, CERN A shared memory-based example: Convert from colour to grayscale We parallelize the computation by assigning tiles (pieces) of the image to threads which execute the conversion in parallel. SPEED

Shared memory and message passing revisited in the many-core era 10 iCSC2016, Aram Santogidis, CERN A message passing-based example Many operating system designs can be placed into one of two very rough categories, depending upon how they…  A GUI with 3 widgets  Text Area  Up scroll button  Down scroll button  Must be interactive (Immediate feedback)

Shared memory and message passing revisited in the many-core era 11 iCSC2016, Aram Santogidis, CERN GUI example implementation: Message passing solution Thread TextArea Thread UpButton Thread DownButton Many operating system designs can be placed into one of two very rough categories, depending upon how they… STructure

Shared memory and message passing revisited in the many-core era 12 iCSC2016, Aram Santogidis, CERN Conventional wisdom about the characteristics of the two models 1.Performance 2.Programmability

Shared memory and message passing revisited in the many-core era 13 iCSC2016, Aram Santogidis, CERN Performance comparison Shared MemoryMessage Passing Hardware supportExtensive (All popular architectures) Limited (Only special purpose architectures) Data transfer Overhead Low (Cache block management in HW) High (Data replication) Access/Sync overhead Sometimes high (Critical section contention, NUMA effects) Low (Local private memory access)

Shared memory and message passing revisited in the many-core era 14 iCSC2016, Aram Santogidis, CERN Programmability comparison Shared MemoryMessage Passing Communication ImplicitExplicit Synchronization Explicit (locks etc.) Implicit (side-effect) Interface (API)Read/write shared data structures, mutex primitives Send/Receive messages, Multicast HazardsRace conditions, Deadlocks, Starvation Deadlocks, Starvation

Shared memory and message passing revisited in the many-core era 15 iCSC2016, Aram Santogidis, CERN Towards the manycore architectures

Shared memory and message passing revisited in the many-core era 16 iCSC2016, Aram Santogidis, CERN The manycore era The graph is from (presentation): “Joshi, Ajay, et al. "Building manycore processor-to-DRAM networks using monolithic silicon photonics." High Performance Embedded Computing (HPEC) Workshop ”  Power limits the frequency increase of the processor.  Moore’s law: The transistors keep doubling every two years  Replication: Increasing number of cores

Shared memory and message passing revisited in the many-core era 17 iCSC2016, Aram Santogidis, CERN On the duality of operating systems structures  Operating Systems are generally classified as:  Message passing oriented  Procedure-oriented (shared memory)  Each system from one category has the other category.  Neither model is inherently better than the other (depends on the machine architecture). From: Lauer, Hugh C., and Roger M. Needham. "On the duality of operating system structures." ACM SIGOPS Operating Systems Review 13.2 (1979): 3-19.

Shared memory and message passing revisited in the many-core era 18 iCSC2016, Aram Santogidis, CERN Non Uniform Memory Access (NUMA) RAM Domain 0 core P0 Last Level Cache CPU 1 P1 CPU 2 P2 CPU 3 P3 Local Cache CPU 4 P4 CPU 5 P5 RAM Domain 1 Last Level Cache Socket 0 Socket 1 QPI Local Cache

Shared memory and message passing revisited in the many-core era 19 iCSC2016, Aram Santogidis, CERN Cache coherence Core 1 Core 0 0 i: j=i; i++; LLC L1 i: 0 Shared Cache Controller BUS Cache Controller L1 i: 0

Shared memory and message passing revisited in the many-core era 20 iCSC2016, Aram Santogidis, CERN Cache coherence Core 1 Core 0 0 i: j=i; i++; LLC L1 i: 1 Modified Cache Controller BUS Cache Controller L1 i: 0 Invalid Bus/ Invalidate Message

Shared memory and message passing revisited in the many-core era 21 iCSC2016, Aram Santogidis, CERN Cache coherence Core 1 Core 0 0 i: j=i; i++; LLC L1 i: 1 Shared Cache Controller BUS Cache Controller L1 i: 1 Bus Read Write back

Shared memory and message passing revisited in the many-core era 22 iCSC2016, Aram Santogidis, CERN A key question When updating shared state, which uproach is more expensive (in terms of latency), Shared memory or Message passing?

Shared memory and message passing revisited in the many-core era 23 iCSC2016, Aram Santogidis, CERN An experiment of shared memory vs message passing performance Thread Core Cache Shared state BUS Hyper Transport CPU in Socket Updating shared state of size [1,8] cachelines, relying on cache coherent shared memory on 4x4 AMD system

Shared memory and message passing revisited in the many-core era 24 iCSC2016, Aram Santogidis, CERN An experiment of shared memory vs message passing performance BUS Hyper Transport CPU in Socket Updating shared state of size [1,8] cachelines, relying on synchronous Lightweight Remote Procedure Calls (message passing) S Server, updating the shared state on behalf of the threads

Shared memory and message passing revisited in the many-core era 25 iCSC2016, Aram Santogidis, CERN Messages scale better than shared memory Message passing scales better than shared memory when increasing the core count and the size of the shared state. The plot is adapted from: Baumann, Andrew, et al. "The multikernel: a new OS architecture for scalable multicore systems." Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM, 2009.

Shared memory and message passing revisited in the many-core era 26 iCSC2016, Aram Santogidis, CERN …some other hints that may lead to further fragmentation of coherency domains

Shared memory and message passing revisited in the many-core era 27 iCSC2016, Aram Santogidis, CERN Increasing Heterogeneity of computing platforms Heterogeneity Time Multi socket Manycores, GPU Coprocessor FPGA Dual cores, pthreads Single cores, Concurrent OS, Coroutines  Message passing: Fundamental for communication in heterogeneous environment  Shared memory: Hard to implement in a heterogeneous environment

Shared memory and message passing revisited in the many-core era 28 iCSC2016, Aram Santogidis, CERN Message passing OS vs Shared memory OS Barrelfish OS (Message passing) Single arch(x86, etc.) GPU Linux Kernel Driver App Linux OS (Shared memory) ARM x86 … OSnode State replica GPU OSnode State replica OSnode State replica Async messages App Architecture dependant code Adapted from : Baumann, Andrew, et al. "The multikernel: a new OS architecture for scalable multicore systems." Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM, 2009

Shared memory and message passing revisited in the many-core era 29 iCSC2016, Aram Santogidis, CERN What to expect ?

Shared memory and message passing revisited in the many-core era 30 iCSC2016, Aram Santogidis, CERN Emerging concurrency paradigms  Asynchronous tasks (Futures/Promises)  Partitioned Global Address Space (PGAS) languages/libraries  Actor Model  Functional Concurrency New high level paradigms are being developed, based on shared memory and/or message passing constructs.

Shared memory and message passing revisited in the many-core era 31 iCSC2016, Aram Santogidis, CERN The future perspective  Communication is the key  For energy efficiency  For runtime performance  To manage software complexity  To manage hardware heterogeneity  Innovation in the hardware sector pressures to systems software engineers to develop appropriate support  At the operating system level  Concurrent programming frameworks level  Communication-oriented tools and techniques to design, implement, analyse concurrent programs

Shared memory and message passing revisited in the many-core era 32 iCSC2016, Aram Santogidis, CERN

Shared memory and message passing revisited in the many-core era 33 iCSC2016, Aram Santogidis, CERN References  Baumann, Andrew, et al. "The multikernel: a new OS architecture for scalable multicore systems." Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM,  Gerber, et al. "Not Your Parents' Physical Address Space", Proceedings ot the 15th Workshop on Hot Topics in Operating Systems (HotOS 15)  A Primer on Memory Consistency and Cache Coherence Daniel J. Sorin, Mark D. Hill, and David A. Wood  Martin, Milo MK, Mark D. Hill, and Daniel J. Sorin. "Why on-chip cache coherence is here to stay." Communications of the ACM 55.7 (2012):  Hansen, Per Brinch. "The invention of concurrent programming." The origin of concurrent programming. Springer New York,  Butcher, Paul. Seven Concurrency Models in Seven Weeks: When Threads Unravel. Pragmatic Bookshelf,  Hoare, Charles Antony Richard. "Communicating sequential processes."Communications of the ACM 21.8 (1978):

Shared memory and message passing revisited in the many-core era 34 iCSC2016, Aram Santogidis, CERN Thank you for your attention Many thanks to my supporters and mentors for this presentation: Sebastian Lopienski, CERN Andreas Joachim Peters, CERN Andreas Hirstius, Intel GmbH Spyros Lalis, University of Thessaly This work is support by the Marie Curie Early European Industrial Doctorates Fellowship of the European Community’s Seventh Framework Programme under contract number (PITN-GA ICE-DIP).