A Coherent and Managed Runtime for ML on the SCC KC SivaramakrishnanLukasz Ziarek Suresh Jagannathan Purdue University SUNY Buffalo Purdue University.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Threads, SMP, and Microkernels
MPI Message Passing Interface
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Eliminating Read Barriers through Procrastination and Cleanliness KC Sivaramakrishnan Lukasz Ziarek Suresh Jagannathan.
Portable, mostly-concurrent, mostly-copying GC for multi-processors Tony Hosking Secure Software Systems Lab Purdue University.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
CS 5204 – Operating Systems 1 Scheduler Activations.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
G Robert Grimm New York University Cool Pet Tricks with… …Virtual Memory.
ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.
CS 584. A Parallel Programming Model We need abstractions to make it simple. The programming model needs to fit our parallel machine model. Abstractions.
Virtual Memory Art Munson CS614 Presentation February 10, 2004.
Concurrency CS 510: Programming Languages David Walker.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Chapter Hardwired vs Microprogrammed Control Multithreading
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
3.5 Interprocess Communication
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Synchronization and Communication in the T3E Multiprocessor.
Eliminating Read Barriers through Procrastination and Cleanliness KC Sivaramakrishnan Lukasz Ziarek Suresh Jagannathan.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Multi-core architectures. Single-core computer Single-core CPU chip.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Message Analysis-Guided Allocation and Low-Pause Incremental Garbage Collection in a Concurrent Language Konstantinos Sagonas Jesper Wilhelmsson Uppsala.
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
DISTRIBUTED COMPUTING
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
The University of Adelaide, School of Computer Science
Distributed Shared Memory
For Massively Parallel Computation The Chaotic State of the Art
Parallel Programming By J. H. Wang May 2, 2017.
Lecture 18: Coherence and Synchronization
CMSC 611: Advanced Computer Architecture
Department of Computer Science University of California, Santa Barbara
Outline Midterm results summary Distributed file systems – continued
Multiple Processor Systems
Yiannis Nikolakopoulos
Lecture 25: Multiprocessors
Presented by Neha Agrawal
Lecture 24: Multiprocessors
Department of Computer Science University of California, Santa Barbara
Lecture 19: Coherence and Synchronization
Presentation transcript:

A Coherent and Managed Runtime for ML on the SCC KC SivaramakrishnanLukasz Ziarek Suresh Jagannathan Purdue University SUNY Buffalo Purdue University

Big Picture 2 No cache coherence Message passing buffers Distributed programming RCCE, MPI, TCP/IP Intel SCCCluster of Machines Shared memory Software Managed Cache-Coherence (SMC) Cache Coherent No change to programming model Automatic memory management ?? Can we program SCC as a cache coherent machine?

Intel SCC Architecture 3 Private Memory Core 0 Private Memory Core 47 Private Memory Core 2 Private Memory Core 1... Shared Memory (off chip) No Cache-Coherence (Caching disabled) Software Managed Cache (Release Consistency) Message Passing Buffers (On die, 8KB per core) How to provide an efficient cache coherence layer?

SMP Programming Model for SCC 4 Abstraction Layer Program Desirable properties – Single address space – Cache coherence – Sequential consistency – Automatic memory management – Utilize MPB for inter-core communication Abstraction Layer – MultiMLton VM 1.A new GC to provide coherent and managed global address space 2.Mapping first-class channel communication on to the MPB SCC

Programming Model MultiMLton – Safety, scalability, ready for future manycore processors – Parallel extension of MLton – a whole-program, optimizing Standard ML compiler – Immutability is default, mutations are explicit ACML – first-class message passing language 5 C send (c, v) v  recv (c) Automatic memory management

Coherent and Managed Address Space

Requirements 1.Single global address space 2.Memory consistency 3.Independent core-local GC Thread-local GC! Private-nursery GC Local heap GC On-the-fly GC Thread-specific heap GC 7

Core 0 Thread-local GC for SCC 8 Local Heap Private Memory Local Heap Private Memory Local Heap Private Memory … Shared Memory Uncached Shared Heap (Mutable Objects) Cached Shared Heap (Immutable Objects) Consistency preservation – No inter-coherence-domain pointers! Independent collection of local heaps Caching disabled SMC - Caching enabled

Heap Invariant Preservation 9 Uncached Shared Heap Cached Shared Heap Local Heap rmrm rmrm xmxm xmxm y im Uncached Shared Heap Cached Shared Heap Local Heap rmrm rmrm xmxm xmxm y im r := x FWD Mutator needs Read Barriers! Write barrier -- Globalization

Maintaining Consistency Local heap objects are not shared by definition Uncached shared heap is consistent by construction Cached shared heap (CSH) uses SMC – Invalidation and flush has to be managed by the runtime – Unwise to invalidate before every CSH read and flush after every CSH write Solution – Key observation: CSH only stores immutable objects! 10

Ensuring Consistency (Reads) 11 Uncached Shared Heap Cached Shared Heap MAX_CSH_ADDRX 0: read(x) Uncached Shared Heap Cached Shared Heap MAX_CSH_ADDR == X Invalidate before read – smcAcquire() Maintain MAX_CSH_ADDR at each core Assume values at ADDR < MAX_CSH_ADDR are up-to- date Up-to-dateStale Up-to-date Stale Cache invalidation combined with read barrier

Ensuring Consistency (Reads) 12 No need to invalidate before read (y) where y < MAX_CSH_ADDR Why? 1.Bump pointer allocation 2.All objects in CSH are immutable y < MAX_CSH_ADDR  Cache invalidated after y was created

Ensuring Consistency (Writes) 13 Writes to shared heap occurs only during globalization Flush cache after globalization –smcRelease() Cache flush combined with write barrier

Garbage Collection Local heaps are collected independently! Shared heaps are collected after stopping all of the cores – Proceeds in SPMD fashion – Each core prepares the shared heap reachable set independently – One core collects the shared heap Sansom’s dual mode GC – A good fit for SCC! 14

GC Evaluation 15 8 MultiMLton benchmarks Memory Access profile – 89% local heap, 10% cached shared heap, 1% uncached shared heap – Almost all accesses are cacheable! 48% faster

ACML Channels on MPB

Challenges – First-class objects – Multiple senders/receivers can share the same channel – Unbounded – Synchronous and asynchronous A blocked communication does not block core Channel Implementation datatype ‘a chan = {sendQ : (‘a * unit thread) Q.t, recvQ : (‘a thread) Q.t} 17

Specializing Channel Communication Message mutability – Mutable messages must be globalized Must maintain consistency – Immutable messages can utilize MPB

Sender Blocks Channel in shared heap, message is immutable and in local heap 19 Shared Heap Local Heap C C t1 V im t1:send(c,v) Local Heap t2 Shared Heap Local Heap C C t1 V im Local Heap t2 Core 0Core 1 Core 0

Receiver Interrupts 20 t2: recv(c) Shared Heap Local Heap C C t1 V im Local Heap t2 Core 1 interrupts core 0 to initiate transfer over MPB Core 0Core 1 Shared Heap Local Heap t1 V im Local Heap t2 Core 0Core 1 V im C C

Message Passing Evaluation 21 On 48-cores, MPB only 9% faster Inter-core interrupt are expensive – Context switches + idling cores + small messages – Polling is not an option due to user-level threading SHM MPB

Conclusion Cache coherent runtime for ML on SCC – Thread-local GC Single address space, Cache coherence, Concurrent collections Most memory accesses are cacheable – Channel communication over MPB Inter-core interrupts are expensive 22

Questions? 23

Read Barrier 24 pointer readBarrier (pointer p) { if (getHeader (p) == FORWARDED) { //A globalized object p = *(pointer*)p; if (p > MAX_CSH_ADDR) { smcAcquire (); MAX_CSH_ADDR = p; } return p; }

Write Barrier 25 val writeBarrier (Ref r, Val v) { if (isObjptr (v) && isInSharedHeap (r) && isInLocalHeap (v)) { v = globalize (v); smcRelease (); } return v; }

Case 4 Channel in shared heap, message is mutable and in local heap 26 Shared Heap Local Heap C C t1 VmVm VmVm t1:send(c,v) Shared Heap Local Heap C C t1 VmVm VmVm Globalize the mutable message

Case 2 Channel in Shared heap, Primitive-valued message 27 Shared Heap Local Heap C C t1 t1:send(c,0) Shared Heap Local Heap C C t1 Remembered Set

Case 3 – Sender Blocks Channel in shared heap, message in shared heap 28 Shared Heap Local Heap C C t1 V V t1:send(c,v) Local Heap t2 Shared Heap Local Heap C C t1 V V Local Heap t2 Remembered Set Core 0Core 1

Case 3 – Receiver Unblocks 29 t2:recv(c) Shared Heap Local Heap C C t1 V V Local Heap t2 Core 1Core 0 Shared Heap Local Heap C C t1 V V Local Heap t2 Core 1Core 0