Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel processing is not easy

Similar presentations


Presentation on theme: "Parallel processing is not easy"— Presentation transcript:

1 Parallel processing is not easy
Good schedules for parallel implementations of composite tasks are hard to find. Parallel implementation Mapping and scheduling Optimization for makespan or throughput User’s idea on Parallel Computations (e.g. STAP) Executable file for a given parallel computer

2 Parallel processing can be made easy with ALPS software framework
/* This is a .cpp file of the partially adaptive STAP methods that the user had in mind. This is optimized for minimum latency or maximum throughput on a given parallel machine. */ #include <alps.h> ... ALPS CodeGen User’s idea on Parallel Computations (e.g. STAP) ALPS Library Various STAP heuristics constructed from ALPS building blocks Find optimal configuration that minimizes overall makespan or maximizes throughput 3. ALPS Scheduler 1. Task Graph 2. ALPS Benchmarker & Modeler Modeling execution times of basic building blocks under various configurations

3 Task Graph: PRI-overlapped STAP
1 Task Graph: PRI-overlapped STAP User describes the algorithm using task graphs ALPS software framework determines the implementation parameters such as number of processors for each task and return the executable code Target cube Data cube Steering vector read fft1d STAP join write split with overlap

4 ALPS Benchmarker and Modeler
2 ALPS Benchmarker and Modeler Measures of actual timings yi for a set of input parameters Candidate terms {gk} such as the size of data cube or the size of processor cube are generated based on the input parameters Model coefficients xk’s are found using least-squares Ttask ≈ ∑(xkgk) Solved incrementally by adding a best column to the model at a time. g1 g2 g3 gn-1 gn y1 y2 ym x1 x2 xn 2 minx W(Ax-b) = minx W

5 Ordinary least-squares
minx W(Ax-b) 2 Ordinary least-squares W = I COV-weighted least-squares (BLUE estimator) W = R-1/2 where R is the covariance matrix one of the best, but requires lots of replications hence not practical VAR-weighted least-squares W = D-1/2 where D=diag(R) INV-weighted least-squares W = diag(1/b) Used ordinary, VAR-weighted and INV-weighted least-squares with 5 replications for each measurement

6 10% better 20% better 30% better 40% better 26.6% error from ordinary
0.1 0.2 0.3 0.4 0.5 0.6 0.7 26.6% error from ordinary least-square 10% better 20% better From weighted least-square 30% better 40% better 7.5% error from INV-weighted least-square From ordinary least-square

7 3 ALPS Scheduler Schedules tree-shaped task graphs
(Tree-shaped graphs) = (linear chains) (joining nodes) A B C rather simple complicated B C A B C A A C B processors t t t Data-parallelism only [3] Task-parallelism preferred [4] Other possibly better schedules

8 We consider both data-parallel (series) and task-parallel (parallel) schedules for a joining node
Scheduling a joining node with two incoming arcs Use dynamic programming considering two incoming arcs in series and in parallel Scheduling a joining node with more than two incoming arcs 1. Sort the incoming arcs based on makespan when maximally parallelized 2. Schedule two incoming nodes at a time 10 A 10 A 1 B 8 C 8 5 C D 1 B 5 D

9 Makespan: achieved vs. estimated
Experiments on 8-node linux cluster: executable codes were generated by ALPS software framework Makespan: achieved vs. estimated 1 2 4 8 20 40 60 80 Makespan (sec) Makespan (sec) estimated achieved 30 20 achieved estimated 10 Data- parallelism only [3] Task- parallelism preferred [4] Our scheduling algorithm Number of processors used

10 Schedules found in makespan minimization
Target cube (32x2048x32) Data cube (32x2048x32) Steering vector (32x1x32) Data-parallelism only [3] ~31 (sec) P1 P8 Task-parallelism preferred [4] ~24 (sec) P1 P8 Our scheduling algorithm ~24 (sec) P1 P8

11 Throughput: achieved vs. estimated
1 2 4 8 0.05 0.1 0.12 Throughput (1/sec) Throughput (1/sec) estimated 0.1 achieved estimated achieved 0.05 Data- parallelism only [3] Task- parallelism preferred [4] Our scheduling algorithm Number of processors used

12 Schedules found in throughput maximization
Target cube (32x2048x32) Data cube (32x2048x32) Steering vector (32x1x32) Data-parallelism only [3] and our scheduling algorithm ~77 (sec) P1 P8 Schedule on 1 processor replicated 8 times Task-parallelism preferred [4] ~45 (sec) P1 P8 Schedule on 4 processors replicated 2 times

13 Conclusion The schedules found by ALPS framework are as good as or often better than those computed by other published algorithms The relative error between the estimated and the achieved makespan/throughput were around 10% confirming that the modeled execution time is useful in predicting performance Weighted least-squares produced estimates 20—30% better than ordinary least-squares in terms of relative errors; may be useful when modeling the unreliable timings.

14 Related works Frameworks Scheduling algorithms [1] PVL and [2] S3P
We use text version of task graph as front-end and source code generation as back-end without users dealing with C++ code We cover larger scheduling problem space (tree-structured task graphs) than S3P. Scheduling algorithms [3] J. Subholk and G. Vondran: chain of tasks only [4] D. Nicol, R. Simha, and A. Choudhary: s-p task graph, but task- parallelism is preferred (… and many other works; refer to reference in the abstract)

15 References [1] Eddie Rutledge and Jeremy Kepner, “PVL: An Object Oriented Software Library for Parallel Signal Processing”, IEEE Cluster 2001. [2] H. Hoffmann, J. V. Kepner, and R. A. Bond, “S3P: automatic, optimized mapping of signal processing applications to parallel architectures,” in Proceedings of the Fifth Annual High-Performance Embedded Computing (HPEC) Workshop, Nov [3] J. Subhlok and G. Vondran, “Optimal Use of Mixed Task and Data Parallelism for Pipelined Computations”, Journal of Parallel and Distributed Computing, vol. 60, 2000. [4] D. Nicol, R. Simha, and A. Choudhary, “Optimal Processor Assignments for a Class of Pipelined Computations”, IEEE Transaction on Parallel and Distributed Systems, vol. 5(4), 1994.


Download ppt "Parallel processing is not easy"

Similar presentations


Ads by Google