vNUMA: Virtual Multiprocessors on Clusters of Workstations

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

L.N. Bhuyan Adapted from Patterson’s slides

Distributed Processing, Client/Server and Clusters

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Disco Running Commodity Operating Systems on Scalable Multiprocessors Presented by Petar Bujosevic 05/17/2005 Paper by Edouard Bugnion, Scott Devine, and.

Background Computer System Architectures Computer System Software.

Distributed Processing, Client/Server, and Clusters

CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 25: Distributed Shared Memory All slides © IG.

Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.

Computer ArchitectureFall 2007 © November 21, 2007 Karem A. Sakallah Lecture 23 Virtual Memory (2) CS : Computer Architecture.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

Disco Running Commodity Operating Systems on Scalable Multiprocessors.

Introduction  What is an Operating System  What Operating Systems Do  How is it filling our life 1-1 Lecture 1.

Disco Running Commodity Operating Systems on Scalable Multiprocessors.

1 Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine, and Mendel Rosenblum, Stanford University, 1997.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.

Computer System Architectures Computer System Software

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.

Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.

Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine, and Mendel Rosenblum Summary By A. Vincent Rayappa.

Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.

Supporting Multi-Processors Bernard Wong February 17, 2003.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.

Scott Ferguson Section 1

MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,

1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

Virtualizing a Multiprocessor Machine on a Network of Computers Easy & efficient utilization of distributed resources Goal Kenji KanedaYoshihiro OyamaAkinori.

Background Computer System Architectures Computer System Software.

Primitive Concepts of Distributed Systems Chapter 1.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Ottawa Linux Symposium Christoph Lameter, Ph.D. Technical Lead Linux Kernel Software Silicon Graphics, Inc. Extreme High.

Section 9: Virtual Memory (VM)

Reactive NUMA A Design for Unifying S-COMA and CC-NUMA

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

CS 147 – Parallel Processing

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Introduction to Operating Systems

CMSC 611: Advanced Computer Architecture

Page Replacement.

Lecture 1: Parallel Architecture Intro

Lecture 26 A: Distributed Shared Memory

Symmetric Multiprocessing (SMP)

Outline Midterm results summary Distributed file systems – continued

Multiple Processor Systems

Multiple Processor Systems

Lecture 4- Threads, SMP, and Microkernels

User-level Distributed Shared Memory

Virtual Memory Overcoming main memory size limitation

Subject Name: Operating System Concepts Subject Number:

Multiple Processor and Distributed Systems

High Performance Computing

Lecture 26 A: Distributed Shared Memory

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Database System Architectures

Distributed Resource Management: Distributed Shared Memory

Operating System Overview

Presentation transcript:

vNUMA: Virtual Multiprocessors on Clusters of Workstations Matthew Chapman University of New South Wales matthewc@cse.unsw.edu.au Theme created by Sakari Koivunen and Henrik Omma Released under the LGPL license.

Some problems can be solved faster with more than one processor. Parallel computing Some problems can be solved faster with more than one processor.

Symmetric multiprocessors (SMP) Commodity Low communication overhead Limited # of CPUs

Large shared memory systems (NUMA) Natural extension of SMP Ease of use and programming Cost

Clusters of workstations (COW) Cost Complexity

Clusters of workstations (COW) Cost Complexity

Clusters of workstations (COW) Cost Complexity “If you were ploughing a field, which would you rather use? Two strong oxen or 1024 chickens?” — Seymour Cray

Addressing the complexity Middleware Distributed operating systems Virtualization?

Virtualization Multiple virtual computers on a single computer

Virtualization Virtual NUMA system on a cluster ?

vNUMA Gigabit Ethernet (or better)

vNUMA Gigabit Ethernet (or better) vNUMA Disk Network Console

vNUMA Thin hypervisor Booted from ELILO Discovers other resources on network

Distributed Shared Memory (DSM) Simulate shared memory using virtual memory mechanisms Page granularity (4KB)

Distributed Shared Memory (DSM) Simulate shared memory using virtual memory mechanisms Page granularity (4KB) node 0 permissions on X node 1 permissions on X rw — write X read X page fault! fetch page with X

Distributed Shared Memory (DSM) Simulate shared memory using virtual memory mechanisms Page granularity (4KB) node 0 permissions on X node 1 permissions on X rw — write X read X page fault! fetch page with X r r page data continue read X succeeds

Memory management in an OS Virtual address p1 Controlled by OS Physical address

Memory management in vNUMA Virtual address p1 Controlled by OS Effective mapping in TLB Physical address p1∩p2 p2 Controlled by vNUMA Machine address

Memory management in vNUMA Virtual address rw Controlled by OS Effective mapping in TLB Physical address r only r Controlled by vNUMA Machine address

Page fault handling (simplified) Due to OS permissions? Yes Deliver fault to OS No Owned by another node? Yes Fetch page from owner No Access for write? Yes Invalidate other copies No Allow access

Optimisation: Write updates Transmit individual writes instead of page-level invaludations in some cases Writes queued and sent in batches node 0 permissions on X node 1 permissions on X r r write X page fault! write X = Y continue

Optimisation: Critical Word First Remote reads return requested word first Allows faster restart node 0 permissions on X node 1 permissions on X rw — read X page fault! fetch X word at X continue r r page data

Performance: Scientific app. SPLASH-2 Water-N2 application overhead small Time (s) Problem size

Performance: Scientific app. SPLASH-2 Barnes application larger overhead Time (s) Problem size

Performance: Parallel compile

Performance: Parallel compile 1 node Severe negative scaling!

Profiling tools: vnumaviz Fault address vs time (Barnes application) Virtual address Time

Profiling tools: vnumaprof Compile benchmark profile Total DSM overhead: 280396.830 ms in 211417r/1825x/135040w/106995i/371367s User: 16231.443 ms in 22677r/1823x/8599w/5793i/287s (5.7%) Kernel: 264165.387 ms in 188740r/2x/126441w/101202i/371080s (94.2%) Read: 106757.495 ms in 211417r/0x/0w/0i/0s (38.1%) Execute: 828.677 ms in 0r/1825x/0w/0i/0s (0.3%) Write: 44071.038 ms in 0r/0x/135040w/0i/0s (15.7%) Invalidate: 31117.793 ms in 0r/0x/0w/106995i/0s (11.1%) Semaphore: 97621.827 ms in 0r/0x/0w/0i/371367s (34.8%) @ 0xa0000001004d19a0: 30362.010 ms in 0r/0x/13684w/8060i/88400s in __raw_spin_lock_flags() at include/asm/spinlock.h:78 @ 0xa0000001002eae30: 7889.798 ms in 0r/0x/17774w/4268i/0s in clear_page() @ 0xa000000100100d10: 6726.483 ms in 0r/0x/2091w/2024i/22259s in atomic_add_negative() at include/asm/atomic.h:135 @ 0xa0000001002dd940: 6315.203 ms in 0r/0x/685w/916i/20690s in _atomic_dec_and_lock() at lib/dec_and_lock.c:24 @ 0xa0000001000ce370: 5359.691 ms in 0r/0x/2582w/2605i/17701s in find_get_page() at include/linux/mm.h:332 @ 0xa0000001004d2a90: 4411.197 ms in 0r/0x/1484w/924i/12376s in lock_kernel() at include/asm/spinlock.h:78

Improving performance Optimised locks (sub-page granularity) Better I/O distribution More virtual CPUs than physical CPUs (multithreading) Faster interconnect (Infiniband?)

Future possibilities SMP support Other architectures Fault tolerance

Questions?