Erlang Multicore support

Slides:



Advertisements
Similar presentations
Scheduling Criteria CPU utilization – keep the CPU as busy as possible (from 0% to 100%) Throughput – # of processes that complete their execution per.
Advertisements

Computer Abstractions and Technology
CPU Scheduling Basic Concepts Scheduling Criteria
XENMON: QOS MONITORING AND PERFORMANCE PROFILING TOOL Diwaker Gupta, Rob Gardner, Ludmila Cherkasova 1.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
CS533 - Concepts of Operating Systems
Computer Organization and Architecture
Process Concept An operating system executes a variety of programs
Race Conditions CS550 Operating Systems. Review So far, we have discussed Processes and Threads and talked about multithreading and MPI processes by example.
CPU Scheduling - Multicore. Reading Silberschatz et al: Chapter 5.5.
Multi-Threading and Load Balancing Compiled by Paul TaylorCSE3AGR Stolen mainly from Orion Granatir
Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 5: CPU Scheduling.
Process by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
Concurrency & Context Switching Process Control Block What's in it and why? How is it used? Who sees it? 5 State Process Model State Labels. Causes of.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Processes & Threads Introduction to Operating Systems: Module 5.
Martin Kruliš by Martin Kruliš (v1.1)1.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Slides created by: Professor Ian G. Harris Operating Systems  Allow the processor to perform several tasks at virtually the same time Ex. Web Controlled.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Guy Martin, OSLab, GNU Fall-09
lecture 5: CPU Scheduling
Copyright ©: Nahrstedt, Angrave, Abdelzaher
Current Generation Hypervisor Type 1 Type 2.
Software Coherence Management on Non-Coherent-Cache Multicores
Process Management Process Concept Why only the global variables?
PROCESS MANAGEMENT IN MACH
Distributed Processors
Advanced OS Concepts (For OCR)
Copyright ©: Nahrstedt, Angrave, Abdelzaher
Operating Systems (CS 340 D)
The Multikernel: A New OS Architecture for Scalable Multicore Systems
Lecture Topics: 11/1 Processes Process Management
Task Scheduling for Multicore CPUs and NUMA Systems
CPU Scheduling – Multiprocessor
Operating Systems (CS 340 D)
CPU Scheduling.
CS 3204 Operating Systems Lecture 14 Godmar Back.
Chapter 4: Threads.
So far…. Firmware identifies hardware devices present
Chapter 2: The Linux System Part 3
Chapter 5: CPU Scheduling
Process scheduling Chapter 5.
Lecture Topics: 11/1 General Operating System Concepts Processes
Outline Scheduling algorithms Multi-processor scheduling
CPU scheduling decisions may take place when a process:
Multiprocessor and Real-Time Scheduling
Chapter 6: CPU Scheduling
Multithreaded Programming
Chapter 5: CPU Scheduling
CS703 – Advanced Operating Systems
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Operating System Introduction.
OS Components and Structure
Shortest-Job-First (SJR) Scheduling
Chapter 2 Operating System Overview
Process Management -Compiled for CSIT
Chapter 3: Processes Process Concept Process Scheduling
Operating System Overview
Run time performance for all benchmarked software.
Threads CSE 2431: Introduction to Operating Systems
Presentation transcript:

Erlang Multicore support Behind the scenes

Erlang VM (BEAM) when we started Virtual register machine which scheduled light weight processes One single process scheduler and one queue per priority level Switching between I/O operations and process scheduling I/O drivers and “built in functions” (native functions) had exclusive access to the data structures Network code ETS tables Process inspection etc

Perfect program for using multicore A lot of small units of execution The parallel mindset has created applications just waiting to be spread over several physical cores Single core Multi core

Conversion steps Multiple schedulers Parallel I/O Parallel memory allocation Multiple run-queues and generally less global locking

Multiple schedulers Tools Techniques Result Insights Locking order and lock-checker Ordinary test cases Benchmarks (synthetic)‏ Techniques Own thread library (Uppsala University)‏ Lock tables and custom lock implementation for processes Lots of conventional mutexes Result One scheduler per logical core Insights You will have to make memory/speed tradeoffs Lock order enforcement is very helpful

Parallel I/O Tools Techniques Result Insight More simple benchmarks Customer systems Intuition (or – the problem was obvious…)‏ Techniques More fine granular locking Locking on different levels depending on I/O driver implementation Scheduling of I/O-operations Result Real applications parallel… Insight Doing things at the right time can vastly reduce complexity

Multiple allocators Tools Techniques Result Insight Even more benchmarks vTune (Intel-specific)‏ Thread profiler (Intel-specific)‏ Techniques Each scheduler has it’s own instance of memory allocators The “malloc” implementation was already our own Locks are still needed as one scheduler might free another schedulers memory Result Greatly improved performance for CPU intense applications Insight Not only execution has to be distributed over cores

Multiple run-queues and generally less global locking Tools Custom lock counting implemented (not cpu-specific)‏ More massive multicore CPU’s to test on (Tilera, Nehalem)‏ More customer code from more projects Techniques Distributing data over the schedulers Load balancing at certain points More fine granular locking (ETS Meta- and shared tables)‏ Reimplementation of distribution marshaling to remove need for sequential encode/decode Results Far better performance on massive multicore systems Nehalem performance great, but core2 still problematic Insight No global lock will ever fail to create a bottleneck

Multiple runq-ueues Erlang SMP VM (R13) Erlang VM Scheduler #1 migration logic Scheduler #2 runqueue Scheduler #3 runqueue

Migration logic Keep the schedulers equally loaded Load balancing after a specific amount of reductions Calculate an average run-queue length Setup migration paths Each scheduler migrates jobs until happy If a scheduler has suddenly no job, it starts stealing work If, at load-balancing, the logic detects to little work for the schedulers, some will be temporarily inactivated

Example of performance gain w/ multiple run-queues in TilePro64

Comparing “Clovertown” Xeon E5310 to “Gainstown” Xeon X5570 Memory/CPU Interprocess communication

Insights No global lock ever goes unpunished Data as well as execution has to be distributed over cores Malloc and friends will be a bottleneck You will have to make memory/speed tradeoffs New architectures will give you both new challenges and performance boosts Revise and rewrite as processors evolve Doing things (in the code) at the right time can reduce complexity as well as increase performance Take the time to use third party tools and to write your own. Work incrementally

Tools we’ve used Lock checker (implemented in VM) and strict locking order vTune and thread profiler oProfile Lock counter (implemented in VM) Acumem (www.acumem.com) Valgrind Benchmarks Customers Open Source Percept (Erlang application parallelism measurement tool)

What now? Non uniform memory access Schedulers private memory near core Distribute processes smarter, taking memory access into account … Delayed deallocation to avoid allocator lock conflicts Especially important for Core systems Developing our libraries Using less memory? More measuring, benchmarking, customer tests…