Parallelizing Incremental Bayesian Segmentation (IBS) Joseph Hastings Sid Sen.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.
PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Class-constrained Packing Problems with Application to Storage Management in Multimedia Systems Tami Tamir Department of Computer Science The Technion.
Practical techniques & Examples
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Distributed Systems CS
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Cilk NOW Based on a paper by Robert D. Blumofe & Philip A. Lisiecki.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Slide 1 Client / Server Paradigm. Slide 2 Outline: Client / Server Paradigm Client / Server Model of Interaction Server Design Issues C/ S Points of Interaction.
Distributed Computations
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Image Processing Using Cilk 1 Parallel Processing – Final Project Image Processing Using Cilk Tomer Y & Tuval A (pp25)
Parallel Programming Models and Paradigms
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
Linda: A Data-space Approach to Parallel Programming CSE60771 – Distributed Systems David Moore.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Chapter 6: An Introduction to System Software and Virtual Machines
Distributed Computations MapReduce
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
AstroBEAR Parallelization Options. Areas With Room For Improvement Ghost Zone Resolution MPI Load-Balancing Re-Gridding Algorithm Upgrading MPI Library.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Introduction to Parallel Processing 3.1 Basic concepts 3.2 Types and levels of parallelism 3.3 Classification of parallel architecture 3.4 Basic parallel.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Computer Architecture Parallel Processing
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.
Transparent Grid Enablement Using Transparent Shaping and GRID superscalar I. Description and Motivation II. Background Information: Transparent Shaping.
Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
1 Cilk Chao Huang CS498LVK. 2 Introduction A multithreaded parallel programming language Effective for exploiting dynamic, asynchronous parallelism (Chess.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Concurrency and Performance Based on slides by Henri Casanova.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
A Parallel Communication Infrastructure for STAPL
Invitation to Computer Science 6th Edition
CILK: An Efficient Multithreaded Runtime System
Big Data is a Big Deal!.
Prabhanjan Kambadur, Open Systems Lab, Indiana University
QianZhu, Liang Chen and Gagan Agrawal
Parallelizing Incremental Bayesian Segmentation (IBS)
Department of Computer Science University of California, Santa Barbara
Background and Motivation
Overview of Workflows: Why Use Them?
Department of Computer Science University of California, Santa Barbara
Higher Level Languages on Adaptive Run-Time System
Introduction to CILK Some slides are from:
Presentation transcript:

Parallelizing Incremental Bayesian Segmentation (IBS) Joseph Hastings Sid Sen

Outline Background on IBS Code Overview Parallelization Methods (Cilk, MPI) Cilk Version MPI Version Summary of Results Final Comments

Background on IBS

IBS Incremental Bayesian Segmentation [1] is an on-line machine learning algorithm designed to segment time-series data into a set of distinct clusters It models the time-series as the concatenation of processes, each generated by a distinct Markov chain, and attempts to find the most-likely break points between the processes

Training Process During the training phase of the algorithm, IBS builds a set of Markov matrices that it believes are most likely to describe the set of processes responsible for generating the time series

Code Overview

Hi-Level Control Flow main() loops through input file Runs break-point detection Foreach segment: check_out_process() Foreach existing matrix compute_subsumed_marginal_likelihood() Adds segment to set of matrices or subsumes

Parallelizable Computation compute_subsumed_marginal_likelihood() Depends on a single matrix and the new segment Produces a single score The index of the best score must be calculated

Code Status Original code (Java, LISP, Perl) C++ code C code

Parallelization Methods

MPI Library facilitating inter-process communication Provides useful communication routines, particularly MPI_Allreduce, which simultaneously reduces data on all nodes and broadcasts the result

Cilk Originally developed by the Supercomputing Technologies Group at the MIT Laboratory for Computer Science Cilk is a language for multithreaded parallel programming based on ANSI C that is very effective for exploiting highly asynchronous parallelism [3] (which can be difficult to write using message-passing interfaces like MPI)

Cilk Specify number of worker threads or “processors” to create when running a Cilk job No one-to-one mapping of worker threads to processors, hence the quotes Work-stealing algorithm When a processor runs out of work, asks another processor chosen at random for work to do Cilk’s work-stealing scheduler executes any Cilk computation in nearly optimal time Computation on P processors executed in time T p ≤ T 1 /P + O(T  )

Code Status Original code (Java, LISP, Perl) C++ code C code MPICilk

Cilk Version

Code Modifications Keywords: cilk, spawn, sync Convert any methods that will be spawned or that will spawn other (Cilk) methods into Cilk methods In our case: main(), check_out_process(), compute_subsumed_marginal_likelihood() Main source of parallelism comes from subsuming current process with each existing process and choosing subsumption with the best score spawn compute_subsumed_marginal_likelihood(proc, get(processes,i), copy_process_list(processes));

Code Modifications When updating global_score need to enforce mutual exclusion between worker threads Cilk_lockvar score_lock;... Cilk_lock(score_lock);... Cilk_unlock(score_lock);

Cilk Results Optimal performance achieved using 2 processors (trade-off between overhead of Cilk and parallelism of program)

Adaptive Parallelism Real intelligence is in the Cilk runtime system, which handles load balancing, paging, and communication protocols between running worker threads Currently have to specify the number of processors to run a Cilk job on Goal is to eventually make the runtime system adaptively parallel by intelligently determining how many threads/processors to use Fair and efficient allocation among all running Cilk jobs Cilk Macroscheduler [4] uses steal rate of worker thread as a measure of its processor desire (if a Cilk job spends a substantial amount of its time stealing, then the job has more processors than it desires)

MPI Version

Code Modifications check_out_process() first broadcasts the segment using MPI_Bcast() Each process loops over all matrices, but only performs subsumption if (I % np == rank) Each process computes best score, and MPI_Allreduce() is used to reduce this information to the globally best score Each process learns the index of the best matrix and performs the identical subsumption

MPI Results Big improvement from 1 to 2 processors; levels off for 3 or more

Summary of Results

MPI vs. Cilk

Final Comments

MPI vs. Cilk MPI version much more complicated, involved more lines of code, and much more difficult to debug Cilk version required thinking about mutual-exclusion, which MPI avoids Cilk version required few code changes, but conceptually more complicated to think about

References (Presentation) [1] Paola Sebastiani and Marco Ramoni. Incremental Bayesian Segmentation of Categorical Temporal Data [2] Wenke Lee and Salvatore J. Stolfo. Data Mining Approaches for Intrusion Detection [3] Cilk Reference Manual. Supercomputing Technologies Group, MIT Lab for Computer Science. November 9, Available online: [4] R. D. Blumofe, C. E. Leiserson, B. Song. Automatic Processor Allocation for Work-Stealing Jobs. (Work in progress)