Christopher Brown and Kevin Hammond School of Computer Science, University of St. Andrews July 2010 Ever Decreasing Circles: Implementing a skeleton for.

Slides:



Advertisements
Similar presentations
Comparing and Optimising Parallel Haskell Implementations on Multicore Jost Berthold Simon Marlow Abyd Al Zain Kevin Hammond.
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Parallel Symbolic Execution for Structural Test Generation Matt Staats Corina Pasareanu ISSTA 2010.
Tail Recursion. Problems with Recursion Recursion is generally favored over iteration in Scheme and many other languages – It’s elegant, minimal, can.
Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.
Fundamentals of Python: From First Programs Through Data Structures
Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.
OpenFOAM on a GPU-based Heterogeneous Cluster
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
ITEC200 – Week07 Recursion. 2 Learning Objectives – Week07 Recursion (Ch 07) Students can: Design recursive algorithms to solve.
Parallelizing Compilers Presented by Yiwei Zhang.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
JLS - Abr001 João Luís Sobral Departamento de Informática Universidade do Minho Braga - Portugal Scalable Object Oriented Parallel Programming (SCOOPP)
Searching in a Graph CS 5010 Program Design Paradigms “Bootcamp” Lesson 8.4 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
SAGE: Self-Tuning Approximation for Graphics Engines
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
University of Maryland Compiler-Assisted Binary Parsing Tugrul Ince PD Week – 27 March 2012.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved ADT Implementation:
Recursion. Basic problem solving technique is to divide a problem into smaller subproblems These subproblems may also be divided into smaller subproblems.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Com Functional Programming Higher Order Functions and Computation Patterns (II) Marian Gheorghe Lecture 11 Module homepage Mole &
Nonvisual Arrays and Recursion by Chris Brown under Prof. Susan Rodger Duke University June 2012.
War of the Worlds -- Shared-memory vs. Distributed-memory In distributed world, we have heavyweight processes (nodes) rather than threads Nodes communicate.
Computer Science Department Data Structure & Algorithms Lecture 8 Recursion.
Recursion. What is recursion? Rules of recursion Mathematical induction The Fibonacci sequence Summary Outline.
VIPIN VIJAYAN 11/11/03 A Performance Analysis of Two Distributed Computing Abstractions.
Data Structures R e c u r s i o n. Recursive Thinking Recursion is a problem-solving approach that can be used to generate simple solutions to certain.
Java Programming: Guided Learning with Early Objects Chapter 11 Recursion.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
Java Methods Big-O Analysis of Algorithms Object-Oriented Programming
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
An Enhanced Portable Thread Manager Presentation by Vijay Murthi Supervisor: David Levine Committee Members: Behrooz Shirazi and Bob Weems.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
1 Distributed BDD-based Model Checking Orna Grumberg Technion, Israel Joint work with Tamir Heyman, Nili Ifergan, and Assaf Schuster CAV00, FMCAD00, CAV01,
Advanced Functional Programming Tim Sheard 1 Lecture 17 Advanced Functional Programming Tim Sheard Oregon Graduate Institute of Science & Technology Lecture:
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Concurrency and Performance Based on slides by Henri Casanova.
CS321 Functional Programming 2 © JAS Programming with Streams A stream is never finite and could be defined as special polymorphic type data stream.
Introduction toData structures and Algorithms
Introduction to Optimization
Conception of parallel algorithms
Unit 1. Sorting and Divide and Conquer
Pattern Parallel Programming
PROGRAMMING IN HASKELL
Parallel Algorithm Design
SymGrid: Symbolic Computations on Grids
PROGRAMMING IN HASKELL
Closures and Streams cs784(Prasad) L11Clos
Chapter 6 Repetition Objectives ❏ To understand basic loop concepts:
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Summary Background Introduction in algorithms and applications
Unit 3 Test: Friday.
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
TensorFlow: A System for Large-Scale Machine Learning
Mattan Erez The University of Texas at Austin
Presentation transcript:

Christopher Brown and Kevin Hammond School of Computer Science, University of St. Andrews July 2010 Ever Decreasing Circles: Implementing a skeleton for general Parallel Orbit calculations for use in SymGrid-Par.

A General Orbit Calculation Explore a solution space given: An initial set of values; A set of generators; Used in computational algebra: Symmetry of solutions: chemistry, quantum physics, etc. Rubik’s Cube (Permutations). Sequential implementations already exist, concerns about performance.

The Orbit Calculation 1 Starting set f :: Int -> Int f x = (x+1) `mod` 4

The Orbit Calculation 1 Accumulating set f :: Int -> Int f x = (x+1) `mod` 4 f 1 2

The Orbit Calculation 1 Accumulating set f :: Int -> Int f x = (x+1) `mod` 4 f 2 2 3

The Orbit Calculation 1 Accumulating set f :: Int -> Int f x = (x+1) `mod` 4 f

The Orbit Calculation 1 Accumulating set f :: Int -> Int f x = (x+1) `mod` 4 f

State-of-the-art A sequential version already exists in GAP. But we need to be able to compute the orbit of millions of iterations. Parallel version exists in C using hash tables: But fine-tuned to a very specific problem (direct condensation) May not be scalable? There is also a new parallel implementation in GAP (Shpectorov) tuple-based implementation uses SCSCP and dedicated hash-table servers We need a general skeleton that can be used for arbitrary orbits

SymGrid-Par

The Orbit - Sequential Version To our knowledge, this is the first version ever implemented in Haskell. orbitMul :: (Ord a, Eq a) => [ a -> a ] -> [a] -> [a] -> [a] orbitMul gens [] set = set orbitMul gens (t:ts) set = orbitMul gens (ts++new) set' where (new, set') = applyGens gens [t] set [] applyGens =... img =... queue of tasks generators accumulating set of results

The Orbit - Sequential genimg :: Eq a => (a->a) -> [a] -> [a] -> ([a], [a]) genimg g set = if img `elem` set then ([], set ) else (img : queue, img : set ) where img = g q img represents the generator applied to the task (the image of the generator application). Need to check for membership in the result set. add img to new task queue add img to set of results

The Orbit - Sequential applyGens :: Eq a => [ a -> a ] -> [a] -> [a] -> [a] -> ([a],[a]) applyGens [] q s q' = (q', s) applyGens (g:gs) queue set q' = applyGens gs queue set' (q'++queue') where (queue', set') = genimg g queue set Recurse over list of generators. Pass result of an img into next generator application.

Parallel Orbit Need a queue to express tasks waiting to be processed. We need to distribute the queue over available PEs. We use a Task Farm (master/worker) approach

Task Farm (Master/Worker)

Extending to a Parallel Orbit The orbit is not quite a true farm, however. Results from workers must be accumulated and checked for duplicates… a set? Non-duplicates are released as new tasks. Moreover, we must be sure that the orbit will terminate!

Parallel Orbit

Parallel Orbit Calculation orbitPar :: ([ a->a ] -> [a] -> [ [a] ]) -> [a->a] -> [a] -> [a] orbitPar orbitfun gens init = … workerProcs = [ process (concat. (Data.List.map (orbitfun gens))) | n <- [1..noPe] ] toWorker tasks = unshuffle noPe tasks process abstraction

Parallel Orbit Calculation orbitPar :: ([ a->a ] -> [a] -> [ [a] ]) -> [a->a] -> [a] -> [a] orbitPar orbitfun gens init = … addNewTask set (t:ts) c | not (t `member` set) = t : addNewTask (Data.Set.insert t set) ts ((c-1)+nGens) | c <= 1 = [] | otherwise = addNewTask set ts (c - 1) workerProcs = … toWorker tasks = … count of potential tasks

Simple Test Case Test case that gives similar (tunable) granularities. Deliver wide range of result values. Change size of result set by setSize. All tests seeded with 1. f1 s n = (fib ((n `mod` 20) + 10) + n) `mod` setSize f2 s n = (fib ((n `mod` 10) + 20) + n) `mod` setSize f3 s n = (fib ((n `mod` 19) + 10) + n - 1) `mod` setSize orbitOnList [] _ = [] orbitOnList (g:gens) list = map g list : orbitOnList gens list

Measurement Framework Executed on 8-core machine running at 2.66GHz. 4 GB of RAM. Compiled with GHC O2. Runtimes are given as an average over 10 runs. Performance of parallel version against single core parallel version.

Farm Speedup Against Par 1 setSize

Farm - Trace (64000)

Evaluation of the Task Farm Good for regular and well-balanced tasks. Static round-robin distribution. May suffer from load imbalance. Does not distribute tasks in a request driven way.

A Workpool Approach Distributes tasks in a request driven way when a task completes, its processor is added to the queue of idle processors Better for irregular and unbalanced tasks. Automatically deals with load imbalance. Still limited by master/worker ratio

Workpool

Workpool Speedup Against Par 1 setSize

Workpool - Trace (64000)

Conclusions Speedup appears almost linear up to a factor of 8.29 on 8 cores for a set size of Workpool is more efficient, and gives better speed ups for larger set sizes. Workpool may incur slight overhead, noticeable in small set sizes. Workpool is more balanced for larger set sizes.

Work in Progress Integrating orbit skeleton into SymGrid-Par. Use GAP to compute the computational algebra… … Haskell to exploit parallelism. Application to larger problems e.g. the braid orbit Develop tool support to aid parallel development e.g. using refactoring First of a series of domain-specific parallel skeletons duplicate elimination, completion algorithm, chain reduction, …

Future Work Complete SGP integration. Solve some real symbolic computing problems. Tool support for sequential -> parallel transformations? Implement more parallel skeletons: Parallel nub ?

Parallel Orbit Calculation orbitPar :: ([ a->a ] -> [a] -> [ [a] ]) -> [a->a] -> [a] -> [a] orbitPar orbitfun gens init = dat where newTasks = merge (zipWith (#) workerProcs (toWorker dat)) dat = (addNewTask empty (init' ++ newTasks) (length init')) init' = take noPe (cycle init) addNewTask set (t:ts) c | not (t `member` set) = t : addNewTask (Data.Set.insert t set) ts c' | c <= 1 = [] | otherwise = addNewTask set ts (c - 1) where c' = (c-1) + nGens workerProcs = [ process (concat. (Data.List.map (orbitfun gens))) | n <- [1..noPe] ] toWorker tasks = unshuffle noPe tasks

Eden Semi-explicit model of parallelism. Explicit process creation. Implicit thread creation: (unzip. streamf) :: Num a => [a] -> ([a],[a]) uncurry zip ((process (unzip. streamf) # [1..10]) where streamf args = map worker args worker x = (factorial x, fibonacci x)

Questions?

Workpool orbitPar :: ([ a->a ] -> [a] -> [ [a] ]) -> [a->a] -> [a] -> [a] orbitPar orbitfun gens init = dat where (newReqs, newTasks) = (unzip. merge) (zipWith (#) workerProcs (toWorker dat)) dat = (addNewTask empty (init' ++ newTasks) (length init')) init' = take noPe (cycle init) addNewTask set (t:ts) c | not (t `member` set) = t : addNewTask (Data.Set.insert t set) ts c' | c <= 1 = [] | otherwise = addNewTask set ts (c - 1) where c' = (c-1) + nGens workerProcs = [ process (zip [n,n..]. (concat. (Data.List.map (orbitfun gens)))) | n <- [1..noPe] ] toWorker tasks = distribute tasks requests requests = initialReqs ++ newReqs