A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,

Slides:

Advertisements

Similar presentations

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.

Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Choosing an Order for Joins

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

OpenFOAM on a GPU-based Heterogeneous Cluster

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Panda: MapReduce Framework on GPU’s and CPU’s

Metaheuristics The idea: search the solution space directly. No math models, only a set of algorithmic steps, iterative method. Find a feasible solution.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.

GPU Architecture and Programming

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Full and Para Virtualization

Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

CS 732: Advance Machine Learning

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Next Generation of Apache Hadoop MapReduce Owen

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Early Results of Deep Learning on the Stampede2 Supercomputer

Parallel Algorithm Design

Parallel Programming in C with MPI and OpenMP

metaheuristic methods and their applications

Early Results of Deep Learning on the Stampede2 Supercomputer

Metaheuristic methods and their applications. Optimization Problems Strategies for Solving NP-hard Optimization Problems What is a Metaheuristic Method?

Hybrid Programming with OpenMP and MPI

Multicore and GPU Programming

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science, Texas State University-San Marcos 2 Department of Mathematics, Texas State University-San Marcos

Problem: HPC is Hard to Exploit  HPC application writers are domain experts  They are not typically computer scientists and have little or no formal education in parallel programming  Parallel programming is difficult and error prone  Modern HPC systems are complex  Consist of interconnected compute nodes with multiple CPUs and one or more GPUs per node  Require parallelization at multiple levels (inter-node, intra-node, and accelerator) for best performance A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 2

Target Area: Iterative Local Searches  Important application domain  Widely used in engineering & real-time environments  Examples  All sorts of random restart greedy algorithms  Ant colony opt, Monte Carlo, n-opt hill climbing, etc.  ILS properties  Iteratively produce better solutions  Can exploit large amounts of parallelism  Often have exponential search space A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 3

Our Solution: ILCS Framework  Iterative Local Champion Search (ILCS) framework  Supports non-random restart heuristics  Genetic algorithms, tabu search, particle swarm opt, etc.  Simplifies implementation of ILS on parallel systems  Design goal  Ease of use and scalability  Framework benefits  Handles threading, communication, locking, resource allocation, heterogeneity, load balance, termination decision, and result recording (check pointing) A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 4

User Interface  User writes 3 serial C functions and/or 3 single- GPU CUDA functions with some restrictions size_t CPU_Init(int argc, char *argv[]); void CPU_Exec(long seed, void const *champion, void *result); void CPU_Output(void const *champion);  See paper for GPU interface and sample code  Framework runs Exec (map) functions in parallel A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 5

Internal Operation: Threading A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 6 ILCS master thread starts master forks a worker per core master forks a handler per GPU workers evaluate seeds, record local opt GPU workers evaluate seeds, record local opt handlers launch GPU code, sleep, record result master sporadically finds global opt via MPI, sleeps

Internal Operation: Seed Distribution  E.g., 4 nodes w/ 4 cores (a,b,c,d) and 2 GPUs (1,2)  Benefits  Balanced workload irrespective of number of CPU cores or GPUs (or their relative performance)  Users can generate other distributions from seeds  Any injective mapping results in no redundant evaluations A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 7 each node gets chunk of 64-bit seed range CPUs process chunk bottom up GPUs process chunk top down

Related Work  MapReduce/Hadoop/MARS and PADO  Their generality and unnecessary features for ILS incur overhead and increase learning curve  Some do not support accelerators, some require Java  ILCS framework is optimized for ILS applications  Reduction is provided, does not require multiple keys, does not need secondary storage to buffer data, directly supports non-random restart heuristics, allows early termination, works with GPUs and MICs, targets single-node workstations through HPC clusters A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 8

Evaluation Methodology  Three HPC Systems (at TACC and NICS)  Largest tested configuration A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 9 datacenterknowledge.com

Sample ILS Codes  Traveling Salesman Problem (TSP)  Find shortest tour  4 inputs from TSPLIB  2-opt hill climbing  Finite State Machine (FSM)  Find best FSM config to predict hit/miss events  4 sizes (n = 3, 4, 5, 6)  Monte Carlo method A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 10

FSM Transitions/Second Evaluated A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 11 21,532,197,798,304 s -1 GPU shmem limit Ranger uses twice as many cores as Stampede

TSP Tour-Changes/Second Evaluated A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 12 12,239,050,704,370 s -1 based on serial CPU code CPU pre-computes: O(n 2 ) memory GPU re-computes: O(n) memory each core evals a tour change every 3.6 cycles

TSP Moves/Second/Node Evaluated A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 13 GPUs provide >90% of performance on Keeneland

ILCS Scaling on Ranger (FSM) A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 14 >99% parallel efficiency on 2048 nodes other two systems are similar

ILCS Scaling on Ranger (TSP) A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 15 >95% parallel efficiency on 2048 nodes longer runs are even better

Intra-Node Scaling on Stampede (TSP) A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 16 >98.9% parallel efficiency on 16 threads framework overhead is very small

Tour Quality Evolution (Keeneland) A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 17 quality depends on chance: ILS provides good solution quickly, then progressively improves it

Tour Quality after 6 Steps (Stampede) A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 18 larger node counts typically yield better results faster

Summary and Conclusions  ILCS Framework  Automatic parallelization of iterative local searches  Provides MPI, OpenMP, and multi-GPU support  Checkpoints currently best solution every few seconds  Scales very well (decentralized)  Evaluation  2-opt hill climbing (TSP) and Monte Carlo method (FSM)  AMD + Intel CPUs, NVIDIA GPUs, and Intel MICs  ILCS source code is freely available  Work supported by NSF, NVIDIA and Intel; resources provided by TACC and NICS A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 19