Download presentation
Presentation is loading. Please wait.
Published byGavyn Reames Modified over 9 years ago
1
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems
2
Outline Introduction Related Works Design Implementation Evaluation
3
Introduction Parallel programming in shared-memory systems OpenMP Pthreads Cilk/TBB What if we need multiple machines to solve a problem? Cluster of shared-memory systems
4
Examples Jaguar – ORNL Fastest supercomputer
5
Related Works Message Passing Interface (MPI) Distributed Shared Memory (DSM) Distributed Cilk Intel OpenMP Cluster OpenMP/MPI
6
Task-based Programming Intel Thread Building Block (TBB) Cilk Java Fork/Join Framework (JSR166) Characteristics Fork/Join parallelism Task is a small non-blocking portion of code Allows programmers to easily express fined-grain parallelism
7
Programming Model Considerations There are 2 characteristics inside the cluster In a single machine Implicit communication via shared-memory In a cluster Explicit communication via network Network latency and bandwidth limitation The programming model should be able to capture the hierarchical nature of the system.
8
Programming Model TaskGroup/Task Programming Model Programmer divide computation into TaskGroup Use divide-and-conquer pattern to generate TaskGroup All input be included into TaskGroup itself TaskGroup executes by spawning Tasks Tasks always run in the same machine as its parent TaskGroup Tasks communicate via shared memory
9
public class FibTG extends TaskGroup { int size; protected Long compute() { if (size == 2 || size == 1) return 1L; FibTG first = new FibTG(size - 1); FibTG second = new FibTG(size - 2); first.fork(); return second.invoke() + first.join(); } Fibonacci Example public class FibTG extends TaskGroup { int size; protected Long compute() { if (size == 2 || size == 1) return 1L; FibTG first = new FibTG(size - 2); FibTG second = new FibTG(size - 1); if (size >= 35) { first.remoteFork(); return second.invoke() + first.remoteJoin(); } else { first.fork(); return second.invoke() + first.join(); } Use cutoff value to say that task can be run on another machine Otherwise, run locally
10
Overview fork TaskGroup Tasks Shared-memory Worker Global Queue fork TaskGroup Tasks Shared-memory Worker ResultReturn Scheduler RemoteFork PushTask
11
Matrix Multiplication Example TaskGroup divide and copy matrix into smaller one a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 a1 b1 a2 b3 c1 TaskGroup
12
Task Matrix Multiplication Example Task compute and store result on the TaskGroup’s matrix a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 a1 b1 a2 b3 c1 TaskGroup
13
Scheduling Considerations Work-stealing is common practice in task-based scheduler What kind of modification required to existing local work-stealing scheduler? Work-stealing is not instant. Work-stealing requires serialization. Returning a result require extra works. Bottom Line Steal-0n-demand is not efficient
14
Scheduling Designs Designs Hierarchical queue Work-stealing policy
15
Hierarchical Work-stealing Queue TaskGroup scheduling is entirely based on ‘queue management’ Hierarchical work-stealing queue 3 levels of queue Global Queue Local Queue Thread Queue Pre-fetch To hide network latency
16
Hierarchical Work-stealing Queue Scheduler Global Queue Thread Pool Worker Local Queue
17
Work-stealing Policies Static distribution Immediately distribute in round-robin fashion when you have something in global queue Pro/Con Best when the size of problem (TaskGroup) is equal Fail when there is load imbalance Purely work-stealing Each machine tries to steal when its queue is empty Pro/Con Best when the execution time of TaskGroup is a way bigger than round-trip time You will see network latency.
18
Work-stealing Policies Pre-fetching work-stealing On-demand-steal mode Worker - When the local queue is empty It hints to the global scheduler that it is idling. Scheduler - When some worker is idling It tries to steal from other non-idle workers as much as possible Pre-fetching mode Worker - When the local queue is below than LOW threshold It requests TaskGroup to the global scheduler Worker – When the local queue is higher than HIGH threshold It sends surplus TaskGroup Best when The nature of problem is dynamic
19
On-demand-steal Mode Scheduler Global Queue Thread Pool Worker Local Queue Thread Pool Worker Local Queue Idling Empty
20
Pre-fetching Mode Scheduler Global Queue Thread Pool Worker Local Queue Thread Pool Worker Local Queue Pre-fetch PushTasks
21
Implementation Details Components Java Fork/Join Framework Similar to Thread Building Block. Manage Per-thread Queue MPJ Express (MPI impl. for Java) Establishes point-to-point communication Launches Java App in N-node cluster We implemented Global Scheduler/Local Queue Manager Various Optimization Techniques Work-stealing Policies
22
Evaluation Test Environment Mumble cluster (mumble-01~mumble-40) Intel Q9400, Quad-core 2.66GHz, Shared 6MB L2 Cache, 8GB RAM Benchmark Program Matrix Large Matrix Multiplication(4k x 4k) Word Count Producer/Consumer style implementation N-Queens Classic search problem – recursive task generation, load imbalance Fibonacci Micro-benchmark for evaluating pure overhead
23
Results Scalability and Speedup Relationship between number of TaskGroup generated and execution time
24
Scalability and Speedup
25
TaskGroup and Execution Time
26
Post-mortem What went good Choosing Java, Fork/Join and MPJ express really made our life easier What went wrong Execution time and speed-up do not give any explanation. How did we solve Gathered every possible statistics and trace program execution However, it does not give direct understanding. We tuned various aspects of the system and ran various benchmarks to understand the system.
27
Summary We suggest TaskGroup-based Programming Model Ease of programming Allows dynamic task generation over cluster Scales up to 16 nodes and beyond
28
Questions
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.