Building Composable Parallel Software with Liquid Threads Heidi Pan*, Benjamin Hindman +, Krste Asanovic + *MIT, + UC Berkeley Microsoft Numerical Library.

Slides:



Advertisements
Similar presentations
Vassal: Loadable Scheduler Support for Multi-Policy Scheduling George M. Candea, Oracle Corporation Michael B. Jones, Microsoft Research.
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
Computer Systems/Operating Systems - Class 8
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
Based on Silberschatz, Galvin and Gagne  2009 Threads Definition and motivation Multithreading Models Threading Issues Examples.
 2004 Deitel & Associates, Inc. All rights reserved. Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for Threads.
© 2004, D. J. Foreman 2-1 Concurrency, Processes and Threads.
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
What is Concurrent Programming? Maram Bani Younes.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Threads, Thread management & Resource Management.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
L21: Final Preparation and Course Retrospective (also very brief introduction to Map Reduce) December 1, 2011.
Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.
© 2004, D. J. Foreman 2-1 Concurrency, Processes and Threads.
 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 7 OS System Structure.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,
OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.
Processes Introduction to Operating Systems: Module 3.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Department of Computer Science and Software Engineering
Processes and Virtual Memory
Lithe Composing Parallel Software Efficiently
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
2 Processor(s)Main MemoryDevices Process, Thread & Resource Manager Memory Manager Device Manager File Manager.
Published in ACM SIGPLAN, 2010 Heidi Pan MassachusettsInstitute of Technology Benjamin Hindman UC Berkeley Krste Asanovi´c UC Berkeley 1.
1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Slides created by: Professor Ian G. Harris Operating Systems  Allow the processor to perform several tasks at virtually the same time Ex. Web Controlled.
B ERKELEY P AR L AB 1 Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović HotPar  Berkeley, CA  March.
Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.
B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently PLDI  June 09, 2010 Heidi Pan, Benjamin Hindman, Krste Asanovic  {benh,
Introduction to Operating Systems Concepts
Chapter 4 – Thread Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter 4: Multithreaded Programming
Processes and Threads Processes and their scheduling
Chapter 4 – Thread Concepts
Computer Engg, IIT(BHU)
Task Scheduling for Multicore CPUs and NUMA Systems
Many-core Software Development Platforms
Chapter 4: Threads.
Page Replacement.
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Chapter 4: Threads & Concurrency
Implementing Processes, Threads, and Resources
Outline Operating System Organization Operating System Examples
Foundations and Definitions
Outline Process Management Process manager Hardware process
Implementing Processes, Threads, and Resources
Chapter 3: Process Management
Presentation transcript:

Building Composable Parallel Software with Liquid Threads Heidi Pan*, Benjamin Hindman +, Krste Asanovic + *MIT, + UC Berkeley Microsoft Numerical Library Incubation Team Visit UC Berkeley, April 29, 2008

2 Today’s Parallel Programs are Fragile Parallel programming usually needs to be aware of hardware resources to achieve good performance.  Don’t incur overhead of thread creation if no resources to run in parallel.  Run related tasks on same core to preserve locality. Today’s programs don’t have direct control over resources, but hope that the OS will do the right thing.  Create 1 kernel thread per core.  Manually multiplex work onto kthreads to control locality & task prioritization. Even if the OS tries to bind each thread to a particular core, it’s still not enough! Integer Programming App (B&B) Task Parallel Library (TPL) Runtime P0 spawn OS P1P2P3P4P5 KT0KT1KT2KT3KT4KT5

3 Today’s Parallel Codes are Not Composable Integer Programming App (B&B) Task Parallel Library (TPL) Runtime P0 spawn P1P2P3P4P5 OpenMP Runtime OS parallel for Math Lib (MKL) The system is oversubscribed! Today’s typical solution: use sequential version of libraries within parallel app!

4 Global Scheduler is Not the Right Solution Integer Programming App (B&B) Difficult to design a one-size-fits-all scheduler that provides enough expressiveness and performance for a wide range of codes efficiently.  How do you design a dynamic load- balancing scheduler that preserves locality of both divide-and-conquer and linear algebra algorithms? Difficult to convince all SW vendors and programmers to comply to the same programming model. Difficult to optimize critical sections of code w/o interfering with or changing the global scheduler Solver Generic Global Scheduler (User or OS) parallel constructs spawn, parallel for, …

5 Cooperative Hierarchical Scheduling Goals: Distributed Scheduling  Customizable, scalable, extensible schedulers that make localized code-specific scheduling decisions. Hierarchical Scheduling  Parent decides relative priority of its children. Cooperative Scheduling  Schedulers cooperate with each other to achieve globally optimal performance for app. Integer Programming App (B&B) TPL Scheduler (Parent) OpenMP Scheduler (Child) Solver

6 Cooperative Hierarchical Scheduling Distributed Scheduling  At any point in time, each scheduler has full control over a subset of the kernel threads allotted to the application to schedule its code OpenMP Hierarchical Scheduling   A scheduler decides how many of its kernel threads to give to each child scheduler, and when these threads are given. Cooperative Scheduling   A scheduler decides when to relinquish its kernel threads instead of being pre-empted by its parent scheduler. TPL OpenMPOpenMP OpenMPOpenMP

7 Standardizing Inter-Scheduler Interface Integer Programming App (B&B) TPL Scheduler (Parent) OpenMP Scheduler (Child) Solver Standardized Inter-Scheduler Resource Management Interface to achieve Cooperative Hierarchical Scheduling Need to extend sequential ABI to support the transfer of resources!

8 Updating the ABI for the Parallel World Functional ABI  Call transfers the thread to the callee, which has full control of register & stack resources to schedule its instructions, and cooperatively relinquishes thread upon return.  Identical to sequential call. Integer Programming App (B&B) TPL Scheduler P0 OS P1P2P3P4P5 solve(A) { }; (steal) OpenMP T0 T1 T2T3T4T5 Resource Mgmt ABI   Parallel callee registers with caller to ask for more resources.   Caller enters callee on additional threads that it decides to grant.   Callee cooperatively yields threads. call ret t t call ret reg unreg yield enter

9 The Case for a Resource Mgmt ABI By making resources a first-class citizen, we enable: Composability:  Code can be written without knowing the context in which it will be called to encourage abstraction, reuse, and independence. Scalability:  Code can call any library function without worrying about inadvertently oversubscribing the system’s resources. Heterogeneity:  An application can incorporate parallel libraries that are implemented in different languages and/or linked with different runtimes. Transparency:  A library function looks the same to its caller, regardless of whether its implementation is sequential or parallel.

10 TPL Example: Managing Child Schedulers solve(A) { OpenMP T1 T0 T2 }; call enter T0T1T2 1 T0:1) Push continuations at spawn points onto work queue. 2) Upon child registration, push child’s enter to recruit more threads. 3) Child keeps track of its own parallelism (not pushed onto parent queue). T1: Steal subtree to compute. T2: Steal enter task, which effectively grants the thread to the child. spawn steal steal enter steal steal call

11 MVMult Ex: Managing Variable # of Threads parallel for next task call ret reg unreg yield enter t yield enter Partition work into tasks, each operating on an optimal cache block size. Instead of statically mapping all tasks onto a fixed number of threads (SPMD), tasks are dynamically fetched by current threads (and load balanced). No loss of locality if no reuse of data between tasks. Additional synchronization may be needed to impose an ordering of noncommutative floating-point operations.

12 Liquid Threads Model ret call yield enter yield enter t P0P1P2P3 P0P1P2P3 P0P1P2P3 P0P1P2P3 Thread resources flow dynamically & flexibly between different modules. More robust parallel codes that adapt to different/changing environments.

13 Lithe: Liquid Thread Environment ABI call ret enter yield request : functional cooperative resource management Not a (high-level) programming model. Low-level ABI for expert programmers (compiler/tool/standard library developers) to control resources & map parallel codes. Lithe can be deployed incrementally b/c it supports sequential library function calls & provides some basic cooperative schedulers. Lithe also supports management of other resources, such as memory and bandwidth. Lithe also supports (uncooperative) revocation of resources by the OS.

14 App App Lithe’s Interaction with the OS P0P1P2P3 OS App 1 P0P1P2P3 OS App 2 App 3 Up till now, we’ve implicitly assumed that we’re the only app running, but the OS is usually time-multiplexing multiple apps onto the machine. We believe that a manycore OS should partition the machine spatially & give each app direct control over resources (cores instead of kthreads). The OS may want to dynamically change the resource allocation between the apps depending on the current workload.  Lithe-compliant schedulers are robust and can easily absorb additional threads given by the OS & yield threads voluntarily to the OS.  Lithe-compliant schedulers can also easily dynamically check for contexts from threads pre-empted by the OS to schedule on remaining threads.  Lithe-compliant schedulers don’t use spinlocks (deadlock avoidance). time-multiplexing space-multiplexing (spatial partitioning)

15 Status: In Early Stage of Development Slither Fibonacci on Vthread (Work Stealing Scheduler) add/kill thread Slither simulates a variable-sized partition.   We simulate hard threads using pthreads   We simulate partitions using processes. User can dynamically add/kill threads from the Vthread partition through the Slither prompt & Vthread will adapt.

16 Lithe defines a new parallel ABI that:  supports cooperative hierarchical scheduling.  enables a liquid threads model in which thread resources flow dynamically & flexibly between different modules.  provides the foundation to build composable & robust parallel software. The work is funded partly by Summary