Parallel processing is not easy

Slides:



Advertisements
Similar presentations
L3S Research Center University of Hanover Germany
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
A Stratified Approach for Supporting High Throughput Event Processing Applications July 2009 Geetika T. LakshmananYuri G. RabinovichOpher Etzion IBM T.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Chapter 13 Embedded Systems
Scalable Information-Driven Sensor Querying and Routing for ad hoc Heterogeneous Sensor Networks Maurice Chu, Horst Haussecker and Feng Zhao Xerox Palo.
Classification and Prediction: Regression Analysis
SensIT PI Meeting, January 15-17, Self-Organizing Sensor Networks: Efficient Distributed Mechanisms Alvin S. Lim Computer Science and Software Engineering.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY, P.P , MARCH An ANFIS-based Dispatching Rule For Complex Fuzzy Job Shop Scheduling.
1 Andreea Chis under the guidance of Frédéric Desprez and Eddy Caron Scheduling for a Climate Forecast Application ANR-05-CIGC-11.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
MIT Lincoln Laboratory S3p-HPEC-jvk.ppt S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures Mr. Henry.
Hadoop System simulation with Mumak Fei Dong, Tianyu Feng, Hong Zhang Dec 8, 2010.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Static Process Scheduling
ALPS: Software Framework for Scheduling Parallel Computations with Application to Parallel Space-Time Adaptive Processing Kyusoon Lee and Adam W. Bojanczyk.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
- DAG Scheduling with Reliability - - GridSolve - - Fault Tolerance In Open MPI - Asim YarKhan, Zhiao Shi, Jack Dongarra VGrADS Workshop April 2007.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
CHaRy Software Synthesis for Hard Real-Time Systems
Big Data is a Big Deal!.
2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure
Support Feature Machine for DNA microarray data
Chapter 7. Classification and Prediction
Dynamo: A Runtime Codesign Environment
Background on Classification
Conception of parallel algorithms
SOFTWARE DESIGN AND ARCHITECTURE
Resource Elasticity for Large-Scale Machine Learning
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Course Description Algorithms are: Recipes for solving problems.
Many-core Software Development Platforms
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Gabor Madl Ph.D. Candidate, UC Irvine Advisor: Nikil Dutt
湖南大学-信息科学与工程学院-计算机与科学系
On Spatial Joins in MapReduce
Cse 344 May 4th – Map/Reduce.
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Weaving Abstractions into Workflows
KAIST CS LAB Oh Jong-Hoon
Multi-Objective Optimization
Final Project presentation
Planning and Scheduling in Manufacturing and Services
Creating Computer Programs
EE 4xx: Computer Architecture and Performance Programming
Overview of Workflows: Why Use Them?
Neuro-Computing Lecture 2 Single-Layer Perceptrons
Self-Managed Systems: an Architectural Challenge
Parallel Programming in C with MPI and OpenMP
Course Description Algorithms are: Recipes for solving problems.
Creating Computer Programs
Research: Past, Present and Future
Mattan Erez The University of Texas at Austin
Map Reduce, Types, Formats and Features
Presentation transcript:

Parallel processing is not easy Good schedules for parallel implementations of composite tasks are hard to find. Parallel implementation Mapping and scheduling Optimization for makespan or throughput User’s idea on Parallel Computations (e.g. STAP) Executable file for a given parallel computer

Parallel processing can be made easy with ALPS software framework /* This is a .cpp file of the partially adaptive STAP methods that the user had in mind. This is optimized for minimum latency or maximum throughput on a given parallel machine. */ #include <alps.h> ... ALPS CodeGen User’s idea on Parallel Computations (e.g. STAP) ALPS Library Various STAP heuristics constructed from ALPS building blocks Find optimal configuration that minimizes overall makespan or maximizes throughput 3. ALPS Scheduler 1. Task Graph 2. ALPS Benchmarker & Modeler Modeling execution times of basic building blocks under various configurations

Task Graph: PRI-overlapped STAP 1 Task Graph: PRI-overlapped STAP User describes the algorithm using task graphs ALPS software framework determines the implementation parameters such as number of processors for each task and return the executable code Target cube Data cube Steering vector read fft1d STAP join write split with overlap

ALPS Benchmarker and Modeler 2 ALPS Benchmarker and Modeler Measures of actual timings yi for a set of input parameters Candidate terms {gk} such as the size of data cube or the size of processor cube are generated based on the input parameters Model coefficients xk’s are found using least-squares Ttask ≈ ∑(xkgk) Solved incrementally by adding a best column to the model at a time. g1 g2 g3 gn-1 gn … y1 y2 ym x1 x2 xn 2 minx W(Ax-b) = minx W

Ordinary least-squares minx W(Ax-b) 2 Ordinary least-squares W = I COV-weighted least-squares (BLUE estimator) W = R-1/2 where R is the covariance matrix one of the best, but requires lots of replications hence not practical VAR-weighted least-squares W = D-1/2 where D=diag(R) INV-weighted least-squares W = diag(1/b) Used ordinary, VAR-weighted and INV-weighted least-squares with 5 replications for each measurement

10% better 20% better 30% better 40% better 26.6% error from ordinary 0.1 0.2 0.3 0.4 0.5 0.6 0.7 26.6% error from ordinary least-square 10% better 20% better From weighted least-square 30% better 40% better 7.5% error from INV-weighted least-square From ordinary least-square

3 ALPS Scheduler Schedules tree-shaped task graphs (Tree-shaped graphs) = (linear chains) + (joining nodes) A B C rather simple complicated B C A B C A A C B processors t t t Data-parallelism only [3] Task-parallelism preferred [4] Other possibly better schedules

We consider both data-parallel (series) and task-parallel (parallel) schedules for a joining node Scheduling a joining node with two incoming arcs Use dynamic programming considering two incoming arcs in series and in parallel Scheduling a joining node with more than two incoming arcs 1. Sort the incoming arcs based on makespan when maximally parallelized 2. Schedule two incoming nodes at a time 10 A 10 A 1 B 8 C 8 5 C D 1 B 5 D

Makespan: achieved vs. estimated Experiments on 8-node linux cluster: executable codes were generated by ALPS software framework Makespan: achieved vs. estimated 1 2 4 8 20 40 60 80 Makespan (sec) Makespan (sec) estimated achieved 30 20 achieved estimated 10 Data- parallelism only [3] Task- parallelism preferred [4] Our scheduling algorithm Number of processors used

Schedules found in makespan minimization Target cube (32x2048x32) Data cube (32x2048x32) Steering vector (32x1x32) Data-parallelism only [3] ~31 (sec) P1 P8 Task-parallelism preferred [4] ~24 (sec) P1 P8 Our scheduling algorithm ~24 (sec) P1 P8

Throughput: achieved vs. estimated 1 2 4 8 0.05 0.1 0.12 Throughput (1/sec) Throughput (1/sec) estimated 0.1 achieved estimated achieved 0.05 Data- parallelism only [3] Task- parallelism preferred [4] Our scheduling algorithm Number of processors used

Schedules found in throughput maximization Target cube (32x2048x32) Data cube (32x2048x32) Steering vector (32x1x32) Data-parallelism only [3] and our scheduling algorithm ~77 (sec) P1 P8 Schedule on 1 processor replicated 8 times Task-parallelism preferred [4] ~45 (sec) P1 P8 Schedule on 4 processors replicated 2 times

Conclusion The schedules found by ALPS framework are as good as or often better than those computed by other published algorithms The relative error between the estimated and the achieved makespan/throughput were around 10% confirming that the modeled execution time is useful in predicting performance Weighted least-squares produced estimates 20—30% better than ordinary least-squares in terms of relative errors; may be useful when modeling the unreliable timings.

Related works Frameworks Scheduling algorithms [1] PVL and [2] S3P We use text version of task graph as front-end and source code generation as back-end without users dealing with C++ code We cover larger scheduling problem space (tree-structured task graphs) than S3P. Scheduling algorithms [3] J. Subholk and G. Vondran: chain of tasks only [4] D. Nicol, R. Simha, and A. Choudhary: s-p task graph, but task- parallelism is preferred (… and many other works; refer to reference in the abstract)

References [1] Eddie Rutledge and Jeremy Kepner, “PVL: An Object Oriented Software Library for Parallel Signal Processing”, IEEE Cluster 2001. [2] H. Hoffmann, J. V. Kepner, and R. A. Bond, “S3P: automatic, optimized mapping of signal processing applications to parallel architectures,” in Proceedings of the Fifth Annual High-Performance Embedded Computing (HPEC) Workshop, Nov. 2001. [3] J. Subhlok and G. Vondran, “Optimal Use of Mixed Task and Data Parallelism for Pipelined Computations”, Journal of Parallel and Distributed Computing, vol. 60, 2000. [4] D. Nicol, R. Simha, and A. Choudhary, “Optimal Processor Assignments for a Class of Pipelined Computations”, IEEE Transaction on Parallel and Distributed Systems, vol. 5(4), 1994.