Building Composable Parallel Software with Liquid Threads Heidi Pan, Benjamin Hindman +, Krste Asanovic + MIT, + UC Berkeley Microsoft Numerical Library.

Building Composable Parallel Software with Liquid Threads Heidi Pan*, Benjamin Hindman +, Krste Asanovic + *MIT, + UC Berkeley Microsoft Numerical Library Incubation Team Visit UC Berkeley, April 29, 2008

2 Today’s Parallel Programs are Fragile Parallel programming usually needs to be aware of hardware resources to achieve good performance.  Don’t incur overhead of thread creation if no resources to run in parallel.  Run related tasks on same core to preserve locality. Today’s programs don’t have direct control over resources, but hope that the OS will do the right thing.  Create 1 kernel thread per core.  Manually multiplex work onto kthreads to control locality & task prioritization. Even if the OS tries to bind each thread to a particular core, it’s still not enough! Integer Programming App (B&B) Task Parallel Library (TPL) Runtime P0 spawn OS P1P2P3P4P5 KT0KT1KT2KT3KT4KT5

3 Today’s Parallel Codes are Not Composable Integer Programming App (B&B) Task Parallel Library (TPL) Runtime P0 spawn P1P2P3P4P5 OpenMP Runtime 1 2 8 5 0 0 9 2 0 4 3 7 5 0 5 3 1 2 3 3 9 1 2 2 5 2 0 6 6 8 1 3 2 4 8 6 1 0 0 5 3 7 2 3 5 7 1 9 8 6 2 9 0 0 OS 1 2 8 5 0 0 9 2 0 4 3 7 5 0 5 3 1 2 3 3 9 1 2 2 5 2 0 6 6 8 1 3 2 4 8 6 1 0 0 5 3 7 2 3 5 7 1 9 8 6 2 9 0 0 parallel for Math Lib (MKL) The system is oversubscribed! Today’s typical solution: use sequential version of libraries within parallel app!

4 Global Scheduler is Not the Right Solution Integer Programming App (B&B) Difficult to design a one-size-fits-all scheduler that provides enough expressiveness and performance for a wide range of codes efficiently.  How do you design a dynamic load- balancing scheduler that preserves locality of both divide-and-conquer and linear algebra algorithms? Difficult to convince all SW vendors and programmers to comply to the same programming model. Difficult to optimize critical sections of code w/o interfering with or changing the global scheduler. 1 2 8 5 0 0 9 2 0 Solver Generic Global Scheduler (User or OS) parallel constructs spawn, parallel for, …

5 Cooperative Hierarchical Scheduling Goals: Distributed Scheduling  Customizable, scalable, extensible schedulers that make localized code-specific scheduling decisions. Hierarchical Scheduling  Parent decides relative priority of its children. Cooperative Scheduling  Schedulers cooperate with each other to achieve globally optimal performance for app. Integer Programming App (B&B) TPL Scheduler (Parent) OpenMP Scheduler (Child) 1 2 8 5 0 0 9 2 0 Solver

6 Cooperative Hierarchical Scheduling Distributed Scheduling  At any point in time, each scheduler has full control over a subset of the kernel threads allotted to the application to schedule its code. 9 9 01 1 54 2 39 9 01 1 54 2 3 OpenMP 2 3 57 8 11 4 32 3 57 8 11 4 3 5 5 23 7 28 0 15 5 23 7 28 0 1 1 2 85 0 09 2 01 2 85 0 09 2 0 Hierarchical Scheduling   A scheduler decides how many of its kernel threads to give to each child scheduler, and when these threads are given. Cooperative Scheduling   A scheduler decides when to relinquish its kernel threads instead of being pre-empted by its parent scheduler. TPL OpenMPOpenMP OpenMPOpenMP

7 Standardizing Inter-Scheduler Interface Integer Programming App (B&B) TPL Scheduler (Parent) OpenMP Scheduler (Child) 1 2 8 5 0 0 9 2 0 Solver Standardized Inter-Scheduler Resource Management Interface to achieve Cooperative Hierarchical Scheduling Need to extend sequential ABI to support the transfer of resources!

8 Updating the ABI for the Parallel World Functional ABI  Call transfers the thread to the callee, which has full control of register & stack resources to schedule its instructions, and cooperatively relinquishes thread upon return.  Identical to sequential call. Integer Programming App (B&B) TPL Scheduler P0 OS P1P2P3P4P5 solve(A) { }; (steal) 2 3 8 57 8 0 29 2 0 35 4 3 12 3 8 57 8 0 29 2 0 35 4 3 1 OpenMP T0 T1 T2T3T4T5 Resource Mgmt ABI   Parallel callee registers with caller to ask for more resources.   Caller enters callee on additional threads that it decides to grant.   Callee cooperatively yields threads. call ret t 2 37 82 37 8 8 50 28 50 2 9 25 49 25 4 0 33 10 33 1 t call 2 37 82 37 8 8 50 28 50 2 ret reg unreg 0 33 10 33 1 yield enter 9 25 49 25 4

9 The Case for a Resource Mgmt ABI By making resources a first-class citizen, we enable: Composability:  Code can be written without knowing the context in which it will be called to encourage abstraction, reuse, and independence. Scalability:  Code can call any library function without worrying about inadvertently oversubscribing the system’s resources. Heterogeneity:  An application can incorporate parallel libraries that are implemented in different languages and/or linked with different runtimes. Transparency:  A library function looks the same to its caller, regardless of whether its implementation is sequential or parallel.

10 TPL Example: Managing Child Schedulers 1 2 3 0 solve(A) { 2 3 8 57 8 0 29 2 0 35 4 3 12 3 8 57 8 0 29 2 0 35 4 3 1 OpenMP T1 T0 T2 }; call enter T0T1T2 1 T0:1) Push continuations at spawn points onto work queue. 2) Upon child registration, push child’s enter to recruit more threads. 3) Child keeps track of its own parallelism (not pushed onto parent queue). T1: Steal subtree to compute. T2: Steal enter task, which effectively grants the thread to the child. spawn steal steal enter steal steal call

11 MVMult Ex: Managing Variable # of Threads 1 2 8 5 0 0 9 2 0 4 3 7 5 0 5 3 1 2 3 3 9 1 2 2 5 2 0 6 6 8 1 3 2 4 8 6 1 0 0 5 3 7 2 3 5 7 1 9 8 6 2 9 0 0 parallel for next task call ret reg unreg yield enter t yield enter Partition work into tasks, each operating on an optimal cache block size. Instead of statically mapping all tasks onto a fixed number of threads (SPMD), tasks are dynamically fetched by current threads (and load balanced). No loss of locality if no reuse of data between tasks. Additional synchronization may be needed to impose an ordering of noncommutative floating-point operations.

12 Liquid Threads Model ret call yield enter yield enter t P0P1P2P3 P0P1P2P3 P0P1P2P3 P0P1P2P3 Thread resources flow dynamically & flexibly between different modules. More robust parallel codes that adapt to different/changing environments.

13 Lithe: Liquid Thread Environment ABI call ret enter yield request : functional cooperative resource management Not a (high-level) programming model. Low-level ABI for expert programmers (compiler/tool/standard library developers) to control resources & map parallel codes. Lithe can be deployed incrementally b/c it supports sequential library function calls & provides some basic cooperative schedulers. Lithe also supports management of other resources, such as memory and bandwidth. Lithe also supports (uncooperative) revocation of resources by the OS.

14 App App Lithe’s Interaction with the OS P0P1P2P3 OS App 1 P0P1P2P3 OS App 2 App 3 Up till now, we’ve implicitly assumed that we’re the only app running, but the OS is usually time-multiplexing multiple apps onto the machine. We believe that a manycore OS should partition the machine spatially & give each app direct control over resources (cores instead of kthreads). The OS may want to dynamically change the resource allocation between the apps depending on the current workload.  Lithe-compliant schedulers are robust and can easily absorb additional threads given by the OS & yield threads voluntarily to the OS.  Lithe-compliant schedulers can also easily dynamically check for contexts from threads pre-empted by the OS to schedule on remaining threads.  Lithe-compliant schedulers don’t use spinlocks (deadlock avoidance). time-multiplexing space-multiplexing (spatial partitioning)

15 Status: In Early Stage of Development Slither Fibonacci on Vthread (Work Stealing Scheduler) add/kill thread Slither simulates a variable-sized partition.   We simulate hard threads using pthreads   We simulate partitions using processes. User can dynamically add/kill threads from the Vthread partition through the Slither prompt & Vthread will adapt.

16 Lithe defines a new parallel ABI that:  supports cooperative hierarchical scheduling.  enables a liquid threads model in which thread resources flow dynamically & flexibly between different modules.  provides the foundation to build composable & robust parallel software. The work is funded partly by Summary

Building Composable Parallel Software with Liquid Threads Heidi Pan, Benjamin Hindman +, Krste Asanovic + MIT, + UC Berkeley Microsoft Numerical Library.

Similar presentations

Presentation on theme: "Building Composable Parallel Software with Liquid Threads Heidi Pan, Benjamin Hindman +, Krste Asanovic + MIT, + UC Berkeley Microsoft Numerical Library."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Building Composable Parallel Software with Liquid Threads Heidi Pan*, Benjamin Hindman +, Krste Asanovic + *MIT, + UC Berkeley Microsoft Numerical Library.

Similar presentations

Presentation on theme: "Building Composable Parallel Software with Liquid Threads Heidi Pan*, Benjamin Hindman +, Krste Asanovic + *MIT, + UC Berkeley Microsoft Numerical Library."— Presentation transcript:

Similar presentations

About project

Feedback

Building Composable Parallel Software with Liquid Threads Heidi Pan, Benjamin Hindman +, Krste Asanovic + MIT, + UC Berkeley Microsoft Numerical Library.

Presentation on theme: "Building Composable Parallel Software with Liquid Threads Heidi Pan, Benjamin Hindman +, Krste Asanovic + MIT, + UC Berkeley Microsoft Numerical Library."— Presentation transcript: